20.1 - Three Styles of Input

As you know from your work in STAT 480, the form of your DATA step's INPUT statement depends on how the data values that you are trying to read in are arranged. If all of the data values are standard numeric or character values that are arranged in neatly defined columns as they are here:

LocationDirectionPoint##
DinosaurNMWest26
Ellis IslandNMEast10
EvergladesNPEast52
Grand CanyonNPWest53
Great Smoky MountainsNPEast310
Hawaii VolcanoesNPWest22
Lava BedsNMWest11
Statue of LibertyNMEast10
Theodore RooseveltNPWest22
YellowstoneNPWest911
YosemiteNPWest213

then the data values can be read in using column input. If the data values are arranged in neatly defined columns, but contain some nonstandard characters, such as the dollar sign ($) here:

Book TitleAuthorPrice#
The Kissing HandPenn, Audrey$9.9532
Voice in the ParkBrowne, Anthony$7.9932
Anno's JourneyAnno, Mitsumasa$7.9948
An Egg is QuietAston, Sianna Hutts$11.5536
FlotsamWiesner, David$11.5640
The Snowy DayKeats, Ezra Jack$6.9940
CorduroyFreeman, Don$11.5528
Snowflake BentleyMartin, Jacqueline$10.8832
Knuffle BunnyWilliems, Mo$10.8740
Bear Snores onWilson, karma$10.7232
Rattletrap CarRoot, Phyllis$7.9940
oliver Finds His WayRoot, Phyllis$11.2440

then the data values must be read in using formatted input. And, if the data values are not arranged in neatly defined columns, but are separated by at least one space and contain no special characters, as they are here:

Barnes 21 4.9000 4.620 14.700 43.000 92.000 32.000 34.000
Williams 21 5.600 4730 15.000 44.000 94.000 32.000 34.000
Jackson 21 6.700 4.650 14.000 41.000 88.000 30.000	34.000
Teel 21 11.100 4.780 13.700 39.000 82.000 29.00 35.000
Smith 21 9.200 4.6013.300 39.000 83.000 29.000 34.000
Washington 21 10.100 4.370 13.700 40.000 91.000 31.000 34.000
Stewart 21 4.600 3.930 13.200 37.00 95.00 34.000 36.000
Taylor 21 6.800 4.620 13.70 40.000 86.000 30.000 35.000
Lindsay 21 6.800 4.620 13.700 40.000 86.000 30.000 35.000
Czekaj 21 51.00 4.210 13.100 39.00 91.000 31.000 34.000

then the data values must be read in using list input.

Just in case you haven't had much of a chance to practice these three styles of input since having taken Stat 480, we'll review them briefly in this section. Then, in the remaining part of this lesson (and the next!), we'll learn how to extend the basic styles of input so that we can read just about any raw data file we want into a SAS data set.

One administrative comment first! We'll look at a number of examples in this lesson and throughout the remainder of the course. For the sake of simplicity, whenever possible, we'll read raw data instream using a DATALINES statement (or the equivalent CARDS statement) rather than from an external raw data file using an INFILE statement. This is the lazy learner's way as it will save us from having to download the data file, save the file on our computers, and edit the INFILE statement to reflect the location of the file. Be aware though that, in practice, nine times out of ten, you'll read data from an external raw data file using an INFILE statement.

Column Input Section

Because column input allows you to read variable values that occupy the same columns within each record, you'll often hear it described as "reading fixed-field data." As you know, to use column input, we simply list the variable names in the INPUT statement, immediately following each variable name with its corresponding column positions in each of the data lines.

Example 20.1

The following program uses column input to read in the values of one character variable (name) and four numeric variables (subj, gender, height, and weight) into a temporary SAS data set called temp:

DATA temp;
input subj 1-4 name $ 6-23 gender 25 height 27-28 weight 30-32;
CARDS;
1024 Alice Smith        1 65 125
1167 Maryann White      1 68 140
1168 Thomas Jones       2 68 190
1201 Benedictine Arnold 2 68 190
1302 Felicia Ho         1 63 115
	;
RUN;
 
PROC PRINT data=temp;
	title 'Output dataset: TEMP';
RUN;

Output dataset: TEMP
Obssubjnamegenderheightweight
11024Alice Smith165125
21167Maryann White168140
31168Thomas Jones268190
41201Benedictine Arnold268190
51302Felicia Ho163115

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly.

Here are the important points to recall about column input:

  • When using column input, you are not required to indicate missing values with a placeholder, such as a period. That is, missing values can be left blank.
  • Column input uses the columns specified to determine the length of character variables, thereby allowing character values to exceed the default 8 characters and to have embedded spaces.
  • Column input allows fields to be skipped altogether or to be read in any order.
  • Column input allows only part of a value to be read and allows values to be re-read.
  • Spaces are not required between the data values.

Now, let's investigate how column input can be modified to accommodate reading in values that contain special characters.

Formatted Input Section

The fundamental difference between column input and formatted input is that column input is only appropriate for reading standard numeric data, while formatted input allows us to read both standard and nonstandard numeric data. That is, formatted input combines the features of column input with the ability to read nonstandard data values.

Example 20.2

The following program uses formatted input to read two standard numeric variables (subj and height), two standard character variables (f_name and l_name), and two nonstandard numeric variables (wt_date and calorie) into a temporary SAS data set called temp:

DATA temp;
  input @1  subj 4. 
		@6  f_name $11. 
		@18 l_name $6.
		+3 height 2. 
		+5 wt_date mmddyy8. 
		+1 calorie comma5.;
	format wt_date mmddyy8. calorie comma5.;
	DATALINES;
1024 Alice       Smith  1 65 125 12/1/95  2,036
1167 Maryann     White  1 68 140 12/01/95 1,800
1168 Thomas      Jones  2    190 12/2/95  2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia     Ho     1 63 115 1/1/96   1,972
			;
		RUN;
 
PROC PRINT data = temp;
	title 'Output dataset: TEMP';
	id subj;
RUN;

Output dataset: TEMP
subjf_namel_nameheightwt_datecalorie
1024AliceSmith6512/01/952,036
1167MaryannWhite6812/01/951,800
1168ThomasJones.12/02/952,302
1201BenedictineArnold6811/30/952,432
1302FeliciaHo6301/01/961,972

Recall that the @n absolute pointer control tells SAS to move the input pointer to a specific column number n. For example, the INPUT statement tells SAS to move the input pointer to column 18 before starting to read the values for the l_name variable. Recall that the +n relative pointer control tells SAS to move the input pointer forward n columns to a column number that is relative to the current position. For example, the INPUT statement tells SAS, after reading the values for the l_name variable, to move the input pointer 3 positions to the right before starting to read the values for the height variable.

The numeric informat 4. that follows subj in the INPUT statement tells SAS that the values for the numeric variable subj occupy four columns. The character informat $11. that follows f_name in the INPUT statement tells SAS that the values for the character variable f_name occupy eleven columns. And, the special informats mmddyy8. and comma5. that follow wt_date and calorie, respectively, in the INPUT statement tell SAS what to expect when reading in the values for the numeric variables wt_date and calorie.

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly. Oh ... you might want to also recall that the PRINT procedure displays the values for wt_date and calorie in a meaningful way only because of the presence of the FORMAT statement in the DATA step. If this doesn't sound familiar, you might want to remove the FORMAT statement and re-run the program to see the less than helpful values that are displayed for the wt_date variable.

List Input Section

If the data values that you are trying to read into a SAS data set are not arranged in neatly defined columns but are separated by at least one space and contain no special characters, then you must use list input. To use list input, you simply place your desired variable names in your INPUT statement in the same order that your data fields appear in your input data file.

Example 20.3

The following program uses list input to read in the subject number, name, gender, weight, and height of five individuals into a temporary SAS data set called temp:

DATA temp;
  input subj name $ gender height weight;
  CARDS;
  1024 Alice 1 65 125
  1167 Maryann 1 68 140
  1168 Thomas 2 68 190
  1201 Benedictine . 68 190
  1302 Felicia 1 63 115
  ;
RUN;
 
PROC PRINT data=temp NOOBS;
title 'Output dataset: TEMP';
RUN;

Output dataset: TEMP
subjnamegenderheightweight
1024Alice165125
1167Maryann168140
1168Thomas268190
1201Benedict.68190
1302Felicia163115

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly.

Although list input might be the easiest input style to use, the ease comes with a bit of a price. In its default form, list input does have several limitations, including:

  • Fields must be separated by at least one blank (or other delimiter).
  • Fields must be read in order from left to right.
  • You cannot skip or re-read fields.
  • Missing values must be represented by a placeholder such as a period. (A blank field causes the matching of variable names and values to get out of sync.)
  • Character values can't contain embedded blanks (or other delimiters).
  • Although the width of a field can be greater than eight columns, both character and numeric variables have a default length of 8 bytes. Character values that are longer than eight characters are truncated.
  • Data must be in standard character or numeric format.