As you know from your work in STAT 480, the form of your DATA step's INPUT statement depends on how the data values that you are trying to read in are arranged. If all of the data values are standard numeric or character values that are arranged in neatly defined columns as they are here:
Location | Direction | Point | # | # |
---|---|---|---|---|
Dinosaur | NM | West | 2 | 6 |
Ellis Island | NM | East | 1 | 0 |
Everglades | NP | East | 5 | 2 |
Grand Canyon | NP | West | 5 | 3 |
Great Smoky Mountains | NP | East | 3 | 10 |
Hawaii Volcanoes | NP | West | 2 | 2 |
Lava Beds | NM | West | 1 | 1 |
Statue of Liberty | NM | East | 1 | 0 |
Theodore Roosevelt | NP | West | 2 | 2 |
Yellowstone | NP | West | 9 | 11 |
Yosemite | NP | West | 2 | 13 |
then the data values can be read in using column input. If the data values are arranged in neatly defined columns, but contain some nonstandard characters, such as the dollar sign ($) here:
Book Title | Author | Price | # |
---|---|---|---|
The Kissing Hand | Penn, Audrey | $9.95 | 32 |
Voice in the Park | Browne, Anthony | $7.99 | 32 |
Anno's Journey | Anno, Mitsumasa | $7.99 | 48 |
An Egg is Quiet | Aston, Sianna Hutts | $11.55 | 36 |
Flotsam | Wiesner, David | $11.56 | 40 |
The Snowy Day | Keats, Ezra Jack | $6.99 | 40 |
Corduroy | Freeman, Don | $11.55 | 28 |
Snowflake Bentley | Martin, Jacqueline | $10.88 | 32 |
Knuffle Bunny | Williems, Mo | $10.87 | 40 |
Bear Snores on | Wilson, karma | $10.72 | 32 |
Rattletrap Car | Root, Phyllis | $7.99 | 40 |
oliver Finds His Way | Root, Phyllis | $11.24 | 40 |
then the data values must be read in using formatted input. And, if the data values are not arranged in neatly defined columns, but are separated by at least one space and contain no special characters, as they are here:
Barnes 21 4.9000 4.620 14.700 43.000 92.000 32.000 34.000 Williams 21 5.600 4730 15.000 44.000 94.000 32.000 34.000 Jackson 21 6.700 4.650 14.000 41.000 88.000 30.000 34.000 Teel 21 11.100 4.780 13.700 39.000 82.000 29.00 35.000 Smith 21 9.200 4.6013.300 39.000 83.000 29.000 34.000 Washington 21 10.100 4.370 13.700 40.000 91.000 31.000 34.000 Stewart 21 4.600 3.930 13.200 37.00 95.00 34.000 36.000 Taylor 21 6.800 4.620 13.70 40.000 86.000 30.000 35.000 Lindsay 21 6.800 4.620 13.700 40.000 86.000 30.000 35.000 Czekaj 21 51.00 4.210 13.100 39.00 91.000 31.000 34.000
then the data values must be read in using list input.
Just in case you haven't had much of a chance to practice these three styles of input since having taken Stat 480, we'll review them briefly in this section. Then, in the remaining part of this lesson (and the next!), we'll learn how to extend the basic styles of input so that we can read just about any raw data file we want into a SAS data set.
One administrative comment first! We'll look at a number of examples in this lesson and throughout the remainder of the course. For the sake of simplicity, whenever possible, we'll read raw data instream using a DATALINES statement (or the equivalent CARDS statement) rather than from an external raw data file using an INFILE statement. This is the lazy learner's way as it will save us from having to download the data file, save the file on our computers, and edit the INFILE statement to reflect the location of the file. Be aware though that, in practice, nine times out of ten, you'll read data from an external raw data file using an INFILE statement.
Column Input Section
Because column input allows you to read variable values that occupy the same columns within each record, you'll often hear it described as "reading fixed-field data." As you know, to use column input, we simply list the variable names in the INPUT statement, immediately following each variable name with its corresponding column positions in each of the data lines.
Example 20.1
The following program uses column input to read in the values of one character variable (name) and four numeric variables (subj, gender, height, and weight) into a temporary SAS data set called temp:
DATA temp;
input subj 1-4 name $ 6-23 gender 25 height 27-28 weight 30-32;
CARDS;
1024 Alice Smith 1 65 125
1167 Maryann White 1 68 140
1168 Thomas Jones 2 68 190
1201 Benedictine Arnold 2 68 190
1302 Felicia Ho 1 63 115
;
RUN;
PROC PRINT data=temp;
title 'Output dataset: TEMP';
RUN;
Obs | subj | name | gender | height | weight |
---|---|---|---|---|---|
1 | 1024 | Alice Smith | 1 | 65 | 125 |
2 | 1167 | Maryann White | 1 | 68 | 140 |
3 | 1168 | Thomas Jones | 2 | 68 | 190 |
4 | 1201 | Benedictine Arnold | 2 | 68 | 190 |
5 | 1302 | Felicia Ho | 1 | 63 | 115 |
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly.
Here are the important points to recall about column input:
- When using column input, you are not required to indicate missing values with a placeholder, such as a period. That is, missing values can be left blank.
- Column input uses the columns specified to determine the length of character variables, thereby allowing character values to exceed the default 8 characters and to have embedded spaces.
- Column input allows fields to be skipped altogether or to be read in any order.
- Column input allows only part of a value to be read and allows values to be re-read.
- Spaces are not required between the data values.
Now, let's investigate how column input can be modified to accommodate reading in values that contain special characters.
Formatted Input Section
The fundamental difference between column input and formatted input is that column input is only appropriate for reading standard numeric data, while formatted input allows us to read both standard and nonstandard numeric data. That is, formatted input combines the features of column input with the ability to read nonstandard data values.
Example 20.2
The following program uses formatted input to read two standard numeric variables (subj and height), two standard character variables (f_name and l_name), and two nonstandard numeric variables (wt_date and calorie) into a temporary SAS data set called temp:
DATA temp;
input @1 subj 4.
@6 f_name $11.
@18 l_name $6.
+3 height 2.
+5 wt_date mmddyy8.
+1 calorie comma5.;
format wt_date mmddyy8. calorie comma5.;
DATALINES;
1024 Alice Smith 1 65 125 12/1/95 2,036
1167 Maryann White 1 68 140 12/01/95 1,800
1168 Thomas Jones 2 190 12/2/95 2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia Ho 1 63 115 1/1/96 1,972
;
RUN;
PROC PRINT data = temp;
title 'Output dataset: TEMP';
id subj;
RUN;
subj | f_name | l_name | height | wt_date | calorie |
---|---|---|---|---|---|
1024 | Alice | Smith | 65 | 12/01/95 | 2,036 |
1167 | Maryann | White | 68 | 12/01/95 | 1,800 |
1168 | Thomas | Jones | . | 12/02/95 | 2,302 |
1201 | Benedictine | Arnold | 68 | 11/30/95 | 2,432 |
1302 | Felicia | Ho | 63 | 01/01/96 | 1,972 |
Recall that the @n absolute pointer control tells SAS to move the input pointer to a specific column number n. For example, the INPUT statement tells SAS to move the input pointer to column 18 before starting to read the values for the l_name variable. Recall that the +n relative pointer control tells SAS to move the input pointer forward n columns to a column number that is relative to the current position. For example, the INPUT statement tells SAS, after reading the values for the l_name variable, to move the input pointer 3 positions to the right before starting to read the values for the height variable.
The numeric informat 4. that follows subj in the INPUT statement tells SAS that the values for the numeric variable subj occupy four columns. The character informat $11. that follows f_name in the INPUT statement tells SAS that the values for the character variable f_name occupy eleven columns. And, the special informats mmddyy8. and comma5. that follow wt_date and calorie, respectively, in the INPUT statement tell SAS what to expect when reading in the values for the numeric variables wt_date and calorie.
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly. Oh ... you might want to also recall that the PRINT procedure displays the values for wt_date and calorie in a meaningful way only because of the presence of the FORMAT statement in the DATA step. If this doesn't sound familiar, you might want to remove the FORMAT statement and re-run the program to see the less than helpful values that are displayed for the wt_date variable.
List Input Section
If the data values that you are trying to read into a SAS data set are not arranged in neatly defined columns but are separated by at least one space and contain no special characters, then you must use list input. To use list input, you simply place your desired variable names in your INPUT statement in the same order that your data fields appear in your input data file.
Example 20.3
The following program uses list input to read in the subject number, name, gender, weight, and height of five individuals into a temporary SAS data set called temp:
DATA temp;
input subj name $ gender height weight;
CARDS;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benedictine . 68 190
1302 Felicia 1 63 115
;
RUN;
PROC PRINT data=temp NOOBS;
title 'Output dataset: TEMP';
RUN;
subj | name | gender | height | weight |
---|---|---|---|---|
1024 | Alice | 1 | 65 | 125 |
1167 | Maryann | 1 | 68 | 140 |
1168 | Thomas | 2 | 68 | 190 |
1201 | Benedict | . | 68 | 190 |
1302 | Felicia | 1 | 63 | 115 |
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly.
Although list input might be the easiest input style to use, the ease comes with a bit of a price. In its default form, list input does have several limitations, including:
- Fields must be separated by at least one blank (or other delimiter).
- Fields must be read in order from left to right.
- You cannot skip or re-read fields.
- Missing values must be represented by a placeholder such as a period. (A blank field causes the matching of variable names and values to get out of sync.)
- Character values can't contain embedded blanks (or other delimiters).
- Although the width of a field can be greater than eight columns, both character and numeric variables have a default length of 8 bytes. Character values that are longer than eight characters are truncated.
- Data must be in standard character or numeric format.