Lesson 20: More on Importing Data -- Part I

Overview

In STAT 480, we learned how to read only the most basic data files into a SAS data set. In this lesson (and the next), we'll extend our knowledge in this area by learning how to read just about any data file into SAS — no matter how messy or unstructured the input data file is. In most cases, the data files will be raw ASCII data files that are obtained from exporting data from some other PC software.

Objectives

Upon completion of this lesson, you should be able to:

Upon completing this lesson, you should be able to do the following:

read raw data separated by spaces into a SAS data set (that is, use list input)
read raw data arranged in columns into a SAS data set (that is, use column input)
read raw data not in standard format into a SAS data set (that is, use formatted input)
mix list, column, and formatted input styles to read raw data into a SAS data set
determine when list input, column input, formatted input, or some combination of the three styles should be used to input a raw data file
understand that the lengths of numeric variables are set to 8 by default and therefore do not necessarily coincide with the widths of the numeric informats used in an INPUT statement
state the difference between fixed-length record data files and variable-length record data files
determine when it is appropriate, and how, to use the INFILE statement's PAD option
decide when it is appropriate, and how, to use the INFILE statement's MISSOVER option
determine when it is appropriate, and how, to use the INFILE statement's DLM= option
decide when it is appropriate, and how, to use the INFILE statement's DSD option
determine when it is appropriate, and how, to use the INFILE statement's FIRSTOBS= option
state how to read missing values when using list input
determine when it is appropriate, and how, to specify a range of numeric or character variables in the INPUT statement
utilize the LENGTH statement to modify the length of a character or numeric variable when appropriate
apply the ampersand (&) modifier with list input to read character values that contain embedded blanks
insert the colon (:) modifier with list input to read nonstandard data values and character values that are longer than eight characters, but which have no embedded blanks
explain why with formatted input, the informat determines both the length of character variables and the number of columns that are read
explain why the informat in the modified list input determines only the length of the modified variable, not the number of columns that are read

20.1 - Three Styles of Input

As you know from your work in STAT 480, the form of your DATA step's INPUT statement depends on how the data values that you are trying to read in are arranged. If all of the data values are standard numeric or character values that are arranged in neatly defined columns as they are here:

Location	Direction	Point	#	#
Dinosaur	NM	West	2	6
Ellis Island	NM	East	1	0
Everglades	NP	East	5	2
Grand Canyon	NP	West	5	3
Great Smoky Mountains	NP	East	3	10
Hawaii Volcanoes	NP	West	2	2
Lava Beds	NM	West	1	1
Statue of Liberty	NM	East	1	0
Theodore Roosevelt	NP	West	2	2
Yellowstone	NP	West	9	11
Yosemite	NP	West	2	13

then the data values can be read in using column input. If the data values are arranged in neatly defined columns, but contain some nonstandard characters, such as the dollar sign ($) here:

Book Title	Author	Price	#
The Kissing Hand	Penn, Audrey	$9.95	32
Voice in the Park	Browne, Anthony	$7.99	32
Anno's Journey	Anno, Mitsumasa	$7.99	48
An Egg is Quiet	Aston, Sianna Hutts	$11.55	36
Flotsam	Wiesner, David	$11.56	40
The Snowy Day	Keats, Ezra Jack	$6.99	40
Corduroy	Freeman, Don	$11.55	28
Snowflake Bentley	Martin, Jacqueline	$10.88	32
Knuffle Bunny	Williems, Mo	$10.87	40
Bear Snores on	Wilson, karma	$10.72	32
Rattletrap Car	Root, Phyllis	$7.99	40
oliver Finds His Way	Root, Phyllis	$11.24	40

then the data values must be read in using formatted input. And, if the data values are not arranged in neatly defined columns, but are separated by at least one space and contain no special characters, as they are here:

Barnes 21 4.9000 4.620 14.700 43.000 92.000 32.000 34.000
Williams 21 5.600 4730 15.000 44.000 94.000 32.000 34.000
Jackson 21 6.700 4.650 14.000 41.000 88.000 30.000	34.000
Teel 21 11.100 4.780 13.700 39.000 82.000 29.00 35.000
Smith 21 9.200 4.6013.300 39.000 83.000 29.000 34.000
Washington 21 10.100 4.370 13.700 40.000 91.000 31.000 34.000
Stewart 21 4.600 3.930 13.200 37.00 95.00 34.000 36.000
Taylor 21 6.800 4.620 13.70 40.000 86.000 30.000 35.000
Lindsay 21 6.800 4.620 13.700 40.000 86.000 30.000 35.000
Czekaj 21 51.00 4.210 13.100 39.00 91.000 31.000 34.000

then the data values must be read in using list input.

Just in case you haven't had much of a chance to practice these three styles of input since having taken Stat 480, we'll review them briefly in this section. Then, in the remaining part of this lesson (and the next!), we'll learn how to extend the basic styles of input so that we can read just about any raw data file we want into a SAS data set.

One administrative comment first! We'll look at a number of examples in this lesson and throughout the remainder of the course. For the sake of simplicity, whenever possible, we'll read raw data instream using a DATALINES statement (or the equivalent CARDS statement) rather than from an external raw data file using an INFILE statement. This is the lazy learner's way as it will save us from having to download the data file, save the file on our computers, and edit the INFILE statement to reflect the location of the file. Be aware though that, in practice, nine times out of ten, you'll read data from an external raw data file using an INFILE statement.

Column Input

Because column input allows you to read variable values that occupy the same columns within each record, you'll often hear it described as "reading fixed-field data." As you know, to use column input, we simply list the variable names in the INPUT statement, immediately following each variable name with its corresponding column positions in each of the data lines.

Example 20.1

The following program uses column input to read in the values of one character variable (name) and four numeric variables (subj, gender, height, and weight) into a temporary SAS data set called temp:

DATA temp;
input subj 1-4 name $ 6-23 gender 25 height 27-28 weight 30-32;
CARDS;
1024 Alice Smith        1 65 125
1167 Maryann White      1 68 140
1168 Thomas Jones       2 68 190
1201 Benedictine Arnold 2 68 190
1302 Felicia Ho         1 63 115
	;
RUN;
 
PROC PRINT data=temp;
	title 'Output dataset: TEMP';
RUN;

Output dataset: TEMP
Obs	subj	name	gender	height	weight
1	1024	Alice Smith	1	65	125
2	1167	Maryann White	1	68	140
3	1168	Thomas Jones	2	68	190
4	1201	Benedictine Arnold	2	68	190
5	1302	Felicia Ho	1	63	115

Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly.

Here are the important points to recall about column input:

When using column input, you are not required to indicate missing values with a placeholder, such as a period. That is, missing values can be left blank.
Column input uses the columns specified to determine the length of character variables, thereby allowing character values to exceed the default 8 characters and to have embedded spaces.
Column input allows fields to be skipped altogether or to be read in any order.
Column input allows only part of a value to be read and allows values to be re-read.
Spaces are not required between the data values.

Now, let's investigate how column input can be modified to accommodate reading in values that contain special characters.

Formatted Input

The fundamental difference between column input and formatted input is that column input is only appropriate for reading standard numeric data, while formatted input allows us to read both standard and nonstandard numeric data. That is, formatted input combines the features of column input with the ability to read nonstandard data values.

Example 20.2

The following program uses formatted input to read two standard numeric variables (subj and height), two standard character variables (f_name and l_name), and two nonstandard numeric variables (wt_date and calorie) into a temporary SAS data set called temp:

DATA temp;
  input @1  subj 4. 
		@6  f_name $11. 
		@18 l_name $6.
		+3 height 2. 
		+5 wt_date mmddyy8. 
		+1 calorie comma5.;
	format wt_date mmddyy8. calorie comma5.;
	DATALINES;
1024 Alice       Smith  1 65 125 12/1/95  2,036
1167 Maryann     White  1 68 140 12/01/95 1,800
1168 Thomas      Jones  2    190 12/2/95  2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia     Ho     1 63 115 1/1/96   1,972
			;
		RUN;
 
PROC PRINT data = temp;
	title 'Output dataset: TEMP';
	id subj;
RUN;

Output dataset: TEMP
subj	f_name	l_name	height	wt_date	calorie
1024	Alice	Smith	65	12/01/95	2,036
1167	Maryann	White	68	12/01/95	1,800
1168	Thomas	Jones	.	12/02/95	2,302
1201	Benedictine	Arnold	68	11/30/95	2,432
1302	Felicia	Ho	63	01/01/96	1,972

Recall that the @n absolute pointer control tells SAS to move the input pointer to a specific column number n. For example, the INPUT statement tells SAS to move the input pointer to column 18 before starting to read the values for the l_name variable. Recall that the +n relative pointer control tells SAS to move the input pointer forward n columns to a column number that is relative to the current position. For example, the INPUT statement tells SAS, after reading the values for the l_name variable, to move the input pointer 3 positions to the right before starting to read the values for the height variable.

The numeric informat 4. that follows subj in the INPUT statement tells SAS that the values for the numeric variable subj occupy four columns. The character informat $11. that follows f_name in the INPUT statement tells SAS that the values for the character variable f_name occupy eleven columns. And, the special informats mmddyy8. and comma5. that follow wt_date and calorie, respectively, in the INPUT statement tell SAS what to expect when reading in the values for the numeric variables wt_date and calorie.

Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly. Oh ... you might want to also recall that the PRINT procedure displays the values for wt_date and calorie in a meaningful way only because of the presence of the FORMAT statement in the DATA step. If this doesn't sound familiar, you might want to remove the FORMAT statement and re-run the program to see the less than helpful values that are displayed for the wt_date variable.

List Input

If the data values that you are trying to read into a SAS data set are not arranged in neatly defined columns but are separated by at least one space and contain no special characters, then you must use list input. To use list input, you simply place your desired variable names in your INPUT statement in the same order that your data fields appear in your input data file.

Example 20.3

The following program uses list input to read in the subject number, name, gender, weight, and height of five individuals into a temporary SAS data set called temp:

DATA temp;
  input subj name $ gender height weight;
  CARDS;
  1024 Alice 1 65 125
  1167 Maryann 1 68 140
  1168 Thomas 2 68 190
  1201 Benedictine . 68 190
  1302 Felicia 1 63 115
  ;
RUN;
 
PROC PRINT data=temp NOOBS;
title 'Output dataset: TEMP';
RUN;

Output dataset: TEMP
subj	name	gender	height	weight
1024	Alice	1	65	125
1167	Maryann	1	68	140
1168	Thomas	2	68	190
1201	Benedict	.	68	190
1302	Felicia	1	63	115

Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly.

Although list input might be the easiest input style to use, the ease comes with a bit of a price. In its default form, list input does have several limitations, including:

Fields must be separated by at least one blank (or other delimiter).
Fields must be read in order from left to right.
You cannot skip or re-read fields.
Missing values must be represented by a placeholder such as a period. (A blank field causes the matching of variable names and values to get out of sync.)
Character values can't contain embedded blanks (or other delimiters).
Although the width of a field can be greater than eight columns, both character and numeric variables have a default length of 8 bytes. Character values that are longer than eight characters are truncated.
Data must be in standard character or numeric format.

20.2 - Issues with Reading Fixed-Field Data

For the most part, using column and formatted input to read in data values that are arranged in neatly defined columns is straightforward. There are though — as always — a few sticking points that we'll address in this section. First, we'll learn not to be surprised when the lengths of our numeric variables do not coincide with the widths of the numeric informats we use in our INPUT statement. Then, we'll investigate what to do when trying to read in fixed-field data from a raw data file that contains records of different lengths.

Lengths of Variables

You might recall that the descriptor portion of a data set contains information about the attributes of each variable in your data set. The length of each variable in your data sets is one such attribute. When you use informats to read in some of your data values, it pays to know how SAS defines the lengths of your variables when it processes your DATA step. Let's make our discussion of this concrete by taking a look at an example.

Example 20.4

The following program is identical to the program in Example 20.2, except the PRINT procedure has been replaced with the CONTENTS procedure so that we can explore how SAS defines the lengths of variables:

DATA temp;
input @1  subj 4. 
		@6  f_name $11. 
		@18 l_name $6.
		+3 height 2. 
		+5 wt_date mmddyy8. 
		+1 calorie comma5.;
format wt_date mmddyy8. calorie comma5.;
DATALINES;
1024 Alice       Smith  1 65 125 12/1/95  2,036
1167 Maryann     White  1 68 140 12/01/95 1,800
1168 Thomas      Jones  2    190 12/2/95  2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia     Ho     1 63 115 1/1/96   1,972
;
RUN;
 
PROC CONTENTS data = temp;
	title 'Contents of TEMP Data Set';
RUN;

Contents of TEMP Data Set

The CONTENTS Procedure
Data Set Name	WORK.TEMP	Observations	5
Member Type	DATA	Variables	6
Engine	V9	Indexes	0
Created	2023/09/12 16:10:42	Observation Length	56
Last Modified	2023/09/12 16:10:42	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	WINDOWS_64
Encoding	shift-jis Japanese (SJIS)

Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	1167
Obs in First Data Page	5
Number of Data Set Repairs	0
ExtendObsCounter	YES
Filename	C:\Users\klc137\AppData\Local\Temp\SAS Temporary Files\_TD1668_E1-055551_\temp.sas7bdat
Release Created	9.0401M5
Host Created	X64_10PRO
Owner Name	PSU\klc137
File Size	128KB
File Size (bytes)	131072

Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format
6	calorie	Num	8	COMMA5.
2	f_name	Char	11
4	height	Num	8
3	l_name	Char	6
1	subj	Num	8
5	wt_date	Num	8	MMDDYY8.

Launch and run the SAS program, and when reviewing the output of the CONTENTS procedure, direct your attention to this portion:

Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format
6	calorie	num	8	COMMA5.
2	f_name	Char	11
4	height	Num	8
3	l_name	Char	6
1	subj	Num	8
5	wt_date	Nu	8	MMDDYY8.

Let's see what we can learn from this ... SAS defined the length of the character variables — f_name and l_name — as 11 and 6, respectively. Hmm ... is it just a coincidence that the informats we used in the INPUT statement for these two variables were $11.? and $6., respectively? Of course not! In general, during the compile phase, SAS defines the lengths of the character variables in the program data vector to be the same as the field width w specified by the $w. character informat.

Can the same generalization be made for the numeric variables, such as subj, height, wt_date, and calorie? Apparently not! For example, we used the numeric informat of 4. to read in the subj variable, and yet SAS assigned the subj variable a length of 8 bytes. Similarly, we used the comma5. informat to read in the calorie variable, and yet SAS assigned the calorie variable a length of, not 5, but 8 bytes. In general, by default, SAS stores a numeric value— no matter how many digits the value contains — as a floating-point number in 8 bytes of storage. That is, the length of a stored numeric variable is not affected by an informat's width nor by any other column specifications in an INPUT statement. Unfortunately, that doesn't mean you can be lazy and use a default width of 8 for all of your numeric informats. If you do, you'll quickly see that your data values are not read properly.

Example 20.5

The following program is similar to the program in the previous example, except all of the input pointer controls have been changed to @n absolute pointer controls and the numeric informat for the height variable has been changed from 2. to 8.:

DATA temp;
input @1  subj 4. 
		@6  f_name $11. 
		@18 l_name $6.
		@30 height 8. 
		@34 wt_date mmddyy8. 
		@43 calorie comma5.;
format wt_date mmddyy8. calorie comma5.;
DATALINES;
1024 Alice       Smith  1 65 125 12/1/95  2,036
1167 Maryann     White  1 68 140 12/01/95 1,800
1168 Thomas      Jones  2    190 12/2/95  2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia     Ho     1 63 115 1/1/96   1,972
;
RUN;
 
PROC PRINT data = temp;
	title 'The TEMP Data Set';
RUN;

The TEMP Data Set
Obs	subj	f_name	l_name	height	wt_date	calorie
1	1024	Alice	Smith	.	12/01/95	2,036
2	1167	Maryann	White	.	12/01/95	1,800
3	1168	Thomas	Jones	.	12/02/95	2,302
4	1201	Benedictine	Arnold	.	11/30/95	2,432
5	1302	Felicia	Ho	.	01/01/96	1,972

Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that SAS had trouble reading the data values for the height variable. If you take a look at the input data, you can see why. If you start in column 30, where the height values begin, and count eight columns to the right, you should see that SAS encounters two blank spaces — the one between the height and weight values and the one between the weight and date values. Because a blank space is a character value, and the height variable is defined as numeric, SAS gets confused, and displays an Invalid data message in the log window:

NOTE: Invalid data for height in line 299 30-37.
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
           1024 Alice       Smith  1 65 125 12/1/95  2,036
      subj=1024 f_name=Alice l_name=Smith height=. wt_date=12/01/95 calorie=2,036 _ERROR_=1 _N_=1
NOTE: Invalid data for height in line 300 30-37.
          1167 Maryann     White  1 68 140 12/01/95 1,800
      subj=1167 f_name=Maryann l_name=White height=. wt_date=12/01/95 calorie=1,800 _ERROR_=1 _N_=2
NOTE: Invalid data for height in line 301 30-37.
          1168 Thomas      Jones  2    190 12/2/95  2,302
      subj=1168 f_name=Thomas l_name=Jones height=. wt_date=12/02/95 calorie=2,302 _ERROR_=1 _N_=3
NOTE: Invalid data for height in line 302 30-37.
          1201 Benedictine Arnold 2 68 190 11/30/95 2,432
      subj=1201 f_name=Benedictine l_name=Arnold height=. wt_date=11/30/95 calorie=2,432 _ERROR_=1 _N_=4
NOTE: Invalid data for height in line 303 30-37.
          1302 Felicia     Ho     1 63 115 1/1/96   1,972
      subj=1302 f_name=Felicia l_name=Ho height=. wt_date=01/01/96 calorie=1,972 _ERROR_=1 _N_=5
NOTE: The data set WORK.TEMP has 5 observations and 6 variables.
NOTE: DATA statement used (Total process time):
          real time           0.01 seconds
          cpu time            0.01 seconds

and assigns missing values to the height variable. Because we used @n absolute pointer controls to tell SAS where to begin reading the wt_date and calorie values, SAS has no trouble reading in the data for those variables.

In a nutshell, the moral of the story of the two previous examples is that you need to make sure that you define the widths of your character and numeric informats to be the number of columns the data values occupy in the input data, not the length of their resulting variables.

Reading Variable-Length Records

When you are using column or formatted input, you might run into trouble when the records in your input data file are of different lengths. Data files that have an end-of-record marker after a fixed, predetermined number of columns are called fixed-length record data files. This data file:

fixed-length record data file example

is a fixed-length record data file, because each record ends in the same column. Reading this data file using column input is a straightforward exercise.

Data files that have an end-of-record marker after the last field in each record are called variable-length record data files. This data file:

variable-length record data file example

is a variable-length record data file, because the length of the records varies. Reading this data file using column input requires using the INFILE statement's PAD option or MISSOVER option to tell SAS how to behave when it reaches an end-of-record marker before all of the variables in a record have been read. Let's investigate this more thoroughly.

First, an important administrative comment! So far, we've been able to get away with illustrating all of the above SAS features by reading the data instream using a DATALINES statement. Since the examples that follow concern options that are available for the INFILE statement, we can't be lazy anymore. For each example, you'll have to first right-click on the link to the raw ASCII data file to save it to a convenient location on your computer, and then edit the program's INFILE statement to reflect that location. When all is said and done, you'll appreciate even more the lazy learner's way!

Example 20.6

The following program attempts to read the addresses.dat raw data file:

variable-length record data file example

using formatted input:

DATA temp;
    infile 'C:\yourdrivename\Stat481WC\01importingI\sasndata\addresses.dat';
      input @1 subj 4. name $ 6-23 street $ 27-45;
RUN;
PROC PRINT;
    title 'Temp data set';
RUN;

First, click on the link to the addresses.dat data file in order to save the file to a convenient location on your computer. Then, upon launching the SAS program, edit the INFILE statement so it reflects the location in which you saved the file. Finally, run the SAS program, and review the output. You should see that something has clearly gone awry:

Temp data set
Obs	subj	name	street
1	1024	Alice Smith	1167 Maryann White
2	1168	1201 Benedict A	1302 Felicia Ho

If you take a look at the first record in the data file, you can see how things go downhill from the start. In trying to read the street address of subj 1024, SAS encounters an end-of-record marker before getting to the end of the field at column 45. Therefore, SAS proceeds to do what it has been told to do by default, namely to advance to the next line to look for the data field. That's why it reads "1167 Maryann White" as the street address for subj 1024. Having reached the end of the INPUT statement, SAS then proceeds to the next line and begins to read the data for subj 1168. This time, SAS can't complete reading the columns for name or street before encountering the end-of-record marker. Again, SAS advances to the next line and reads "1201 Benedictine A" as the name for subj 1168. SAS tries to read columns 27-45 as the street address for subj 1168, but again encounters the end-of-record marker before getting to column 45. So, SAS advances to the next line to look for the data field, and in so doing reads "1302 Felicia Ho" as the street address. Having reached the end of the INPUT statement, SAS proceeds to the next line, only to discover that there are no more records to read. The DATA step stops, and we are left with a messed up SAS data set! The following two examples help clean up our mess.

Example 20.7

As you now know, by default, SAS goes to the next data line to read more data if SAS has reached the end of a data line and there are still more variables in the INPUT statement that have not been assigned values. The following program uses the INFILE statement's MISSOVER option to tell SAS not to advance to the next data line to read more values to complete observation, but rather to assign missing values to the remaining variables:

DATA temp;
    infile 'C:\yourdrivename\Stat481WC\01importingI\sasndata\addresses.dat' MISSOVER;
      input @1 subj 4. name $ 6-23 street $ 27-45;
RUN;
PROC PRINT;
    title 'Temp data set';
RUN;

If you haven't already done so, click on the link to the addresses.dat data file in order to save the file to a convenient location on your computer. And, upon launching the SAS program, edit the INFILE statement so it reflects the location in which you saved the file. Then, run the SAS program, and review the output. You should see that our resulting data set is much improved:

Temp data set
Obj	subj	name	street
1	1024	Alice Smith
2	1167	Maryann White	756 Washington Ave.
3	1168	Thomas Jones
4	1201	Benedictine Arnold
5	1302	Felicia Ho	110 Jefferson Blvd.

At least this time, SAS associates the data values that it does read in with the correct variables. It's still not a perfect solution though, is it? In some cases, you might instead want SAS to read in the portions of the data fields that are available, such as 123 Elm St. for the street address for subj 1024, rather than automatically assigning a missing value when an end-of-record marker is encountered. That's where the INFILE statement's PAD option proves useful.

Example 20.8

The following program uses the INFILE statement's PAD option to tell SAS to pad each record in the data file with blanks so that all of the data lines have the same length:

DATA temp;
    infile 'C:\yourdrivename\Stat481WC\01importingI\sasndata\addresses.dat' PAD;
      input @1 subj 4. name $ 6-23 street $ 27-45;
RUN;
PROC PRINT;
    title 'Temp data set';
RUN;

Again, if you haven't already done so, click on the link to the addresses.dat data file in order to save the file to a convenient location on your computer. And, upon launching the SAS program, edit the INFILE statement so it reflects the location in which you saved the file. Then, run the SAS program, and review the output. You should see that we have achieved the most complete and correct data set yet:

Temp data set
Obs	subj	name	street
1	1024	Alice Smith	123 Elm St.
2	1167	Maryann White	756 Washington Ave.
3	1168	Thomas Jones
4	1201	Benedictine Arnold	245 Ferguson Ave.
5	1302	Felicia Ho	110 Jefferson Blvd.

20.3 - Issues with Reading Free-Format Data

Just as you might encounter situations in which reading data using column and formatted input is not a straightforward exercise, so is the case with list input. In this section, we'll investigate two such situations. First, we'll address what you need to do when your input data file contains some missing values. Then, we'll investigate using the LENGTH statement as a way to tell SAS to set the length of character values to be something other than the default 8 characters. Let's jump right in again!

Reading Missing Values Using List Input

As mentioned in our review of list input, whenever our data contain missing values, we have to make sure that there is a placeholder in place of the missing values ... a period or some other character. If not, when using list input, the blank field causes the matching of variable names and values to get out of sync. You can imagine a situation in which you have a large data file ... let's say 5,000 records and 100 variables ... with no placeholders in place for the data values that are missing. You certainly wouldn't want to have to enter missing value placeholders into the data file just so you could read the data file into a SAS data set. Fortunately, there are ways around this problem! You've already seen one way in the last section, namely the use of the INFILE statement's MISSOVER option. That option will serve you well when using list input too, providing all of your missing values are at the end of a record. The other way, namely the use of the INFILE statement's DSD option, will serve you well when your missing values are at the beginning or middle of a record.

Example 20.9

The following program uses list input and the INFILE statement's MISSOVER option to read in the number of books five children read each week in a library's summer reading program when some of the values are missing at the end of a record:

DATA reading;
	infile DATALINES MISSOVER;
	input Name $ Week1-Week5;
	DATALINES;
Robin 3 2 5 1 6
Jack 4 4 4 3 4
Tim 3 0 0
Martin 1 0 1 1
Caroline 2 3 4 5 6
RUN;
 
PROC PRINT data = reading;
	title 'Summer Reading Program';
	var Name Week1-Week5;
RUN;

Summer Reading Program
Obs	Name	Week1	Week2	Week3	Week4	Week5
1	Robin	3	2	5	1	6
2	Jack	4	4	4	3	4
3	Tim	3	0	0	.	.
4	Martin	1	0	1	1	.
5	Caroline	2	3	4	5	6

First, review the data to note that Tim has no values recorded for the fourth and fifth week and Martin has no values recorded for the fifth week. Also, note that just so you don't have to go through the rigamarole of downloading a data file and editing the INFILE statement, we are back to using the DATALINES statement to read the data instream. But, in order to still be able to take advantage of the INFILE statement's options, we have to tell SAS to expect the data after the DATALINES statement. That's why the:

infile DATALINES ...;

statement.

We need to use the INFILE statement just so we can take advantage of the statement's MISSOVER option. Just as is the case for column and formatted input, the MISSOVER option tells SAS not to advance to the next record when it doesn't find values in the current line for all of the INPUT statement variables. At the end of the current record, values that are expected but not found are set to missing.

One more thing to note is the range of numeric variables:

Week1-Week5

that appears in the INPUT statement. When the variable values in the input raw data file are sequential and separated by a blank (or some other such delimiter), you can specify a range of variables in the INPUT statement, as we did in the above program. This is especially useful if your data contains similar variables. For example, if your data contain the answers to a questionnaire, you might want to call your variables q1, q2, q3, ... . Note that you can also specify a range of variables in the PRINT procedure's VAR statement.

Finally, launch and run the SAS program, and review the output to convince yourself that the data are read in properly.

It is important to keep in mind that the MISSOVER option works only for missing values that occur at the end of the record. If your data contains missing values at the beginning or middle of a record, then you need instead to take advantage of the INFILE statement's DSD option.

Example 20.10

The following program uses list input and the INFILE statement's DSD option to read in the results of a survey of five people when some of the values are missing in the middle of a record:

DATA survey;
	infile DATALINES DLM=',' DSD;
	input Name $ (Q1-Q5) ($);
	DATALINES;
Robert,,A,C,A,D
William,B,C,A,D,A
Linda,C,B,,A,C
Lisa,D,D,D,C,A
Katherine,A,B,C,D,A
RUN;
 
PROC PRINT data = survey;
	title 'Survey Results';
	var Name Q1-Q5;
RUN;

Survey Results
Obs	Name	Q1	Q2	Q3	Q4	Q5
1	Robert		A	C	A	D
2	William	B	C	A	D	A
3	Linda	C	B		A	C
4	Lisa	D	D	D	C	A
5	Katherin	A	B	C	D	A

First, review the data to note that Robert and Linda each have one missing value in the middle of their record. Also, note that the data values are delimited with commas now rather than blanks. That's why the:

DLM=','

option appears in the INFILE statement. You might recall from Stat 480 that DLM stands for delimiter. Then, you should notice the range of variables:

(Q1-Q5) ($)

that appears in the INPUT statement. We've used the proper syntax for specifying a range of character variables, namely both the variable list and the $ sign are enclosed in parentheses.

Hmmm ... what else? Oh yeah ... the whole point of this example ... the DSD option that appears in the INFILE statement. The DSD option changes how SAS treats delimiters when list input is used. Specifically, the DSD option:

sets the default delimiter to a comma
treats two consecutive delimiters as a missing value
removes quotation marks from values

That first point clearly makes our inclusion of the DLM=',' option superfluous, but that's okay. It's the second point that is the most relevant and the one that makes our program here work. Let's check it out! Launch and run the SAS program, and review the output to convince yourself that the data are read in properly. Oops ... not quite! The name Katherine contains nine characters. Because SAS defines the length of character variables to be 8 bytes by default, SAS truncates her name and stores it as Katherin without the ending e. We'll learn how to rectify that problem soon, but first one more example of the DSD option.

Example 20.11

The following program illustrates that the DSD option can also be used i) when there is a missing value at the beginning of a record, and ii) when the data are delimited by blanks (in conjunction with the DLM= option):

DATA survey;
	infile DATALINES DLM=' ' DSD;
	input Name $ (Q1-Q5) ($);
	DATALINES;
Robert  A C A D
William B C A D A
Linda C B  A C
 D D D C A
Katherine A B C D A
RUN;
PROC PRINT data = survey;
	title 'Survey Results';
	var Name Q1-Q5;
RUN;

Survey Results
Obs	Name	Q1	Q2	Q3	Q4	Q5
1	Robert		A	C	A	D
2	William	B	C	A	D	A
3	Linda	C	B		A	C
4		D	D	D	C	A
5	Katherin	A	B	C	D	A

The only difference between this program and the previous one is that the comma delimiters have been replaced with blanks, and the first value (Lisa) in the fourth record has been deleted so that our data set contains a missing value at the beginning of a record. Because the data are now delimited with blanks we use the DLM = ' ' option. Launch and run the SAS program, and review the output to convince yourself that the data are read in properly. Oops ... let's go fix that problem with Katherine now.

Modifying the Length of Character Values

As mentioned previously, when you use list input to read raw data, character values are assigned a default length of 8. When we ran the last two programs, that's why Katherine was truncated to Katherin. The LENGTH statement will help us resolve this problem.

Example 20.12

The following program is identical to the previous program, except a LENGTH statement has been added to tell SAS to define the length of the character variable Name to be 9 rather than the default 8:

DATA survey;
	infile DATALINES DLM=' ' DSD;
	length Name $ 9;
	input Name $ (Q1-Q5) ($);
	DATALINES;
Robert  A C A D
William B C A D A
Linda C B  A C
 D D D C A
Katherine A B C D A
RUN;
 
PROC PRINT data = survey;
	title 'Survey Results';
	var Name Q1-Q5;
RUN;

Survey Results
Obs	Name	Q1	Q2	Q3	Q4	Q5
1	Robert		A	C	A	D
2	William	B	C	A	D	A
3	Linda	C	B		A	C
4		D	D	D	C	A
5	Katherine	A	B	C	D	A

Note that the LENGTH statement was intentionally placed before the INPUT statement. That's because variable attributes are defined when the variable is first encountered in the DATA step. If we placed the LENGTH statement after the INPUT statement, SAS would therefore by default define the length of Name to be 8 characters, not the desired 9. As you can see by the inclusion of the dollar sign ($), the LENGTH statement also tells SAS that Name should be treated as a character variable.

Now, go ahead and launch and run the SAS program, and review the output to convince yourself that all of the data are read in properly this time!

20.4 - Modifying List Input

List input can be made even more versatile by using what is called modified list input. Modified list input entails using either the ampersand (&) modifier or the colon (:) modifier:

The ampersand (&) modifier allows you to read character values that contain embedded blanks.
The colon (:) modifier allows you to read nonstandard data values and character values that are longer than eight characters, but which have no embedded blanks.

Let's take a look at an example in which modified list input would be useful.

Example 20.13

The following program attempts to use list input to read the populations of the ten most populous cities in the United States into a temporary SAS data set called citypops, but the program fails:

DATA reading;
DATA citypops;
   infile DATALINES FIRSTOBS = 2;
   input city pop2000;
   DATALINES;
City  Yr2000Popn
New York  8,008,278
Los Angeles  3,694,820
Chicago  2,896,016
Houston  1,953,631
Philadelphia  1,517,550
Phoenix  1,321,045
San Antonio  1,144,646
San Diego  1,223,400
Dallas  1,188,580
San Jose  894,943
;
RUN;
 
PROC PRINT data = citypops;
   title 'The citypops data set';
RUN;

The citypops data set
Obs	city	pop2000
1	.	.
2	.	.
3	.	.
4	.	.
5	.	.
6	.	.
7	.	.
8	.	.
9	.	.
10	.	.

In reviewing the data, the first thing you might notice is that this particular input data file contains a header row:

City  Yr2000Popn

that reports the content of each record. You may often find yourself in a situation in which someone has handed you such a data file, that is, one containing headings in addition to the columns of data. In general, that's a good thing, since then you know for sure what each record contains. It creates a problem though for reading in the data unless you tell SAS to disregard the heading information. That's just what the FIRSTOBS = 2 option in the INFILE statement tells SAS to do. It tells SAS to begin reading data at line 2 instead of the default line 1.

Moving past the header row, you should note the important features of the data. The longest city name is 12 characters. Some of the cities — New York, for example — contain embedded blanks. There are two blank spaces between the city names and their populations. Finally, because the population values contain commas, they are nonstandard values that require an informat during input. Given this list of features, it shouldn't be surprising that the standard list input style used in the INPUT statement fails.

Launch and run the SAS program, and review the output to convince yourself that SAS encounters a serious problem when attempting to read the data into the citypops data set.

The Ampersand (&) Modifier

Because the ampersand (&) modifier allows us to use list input to read character values containing single embedded blanks, it is the tool that we will want to use to read in the city names.

Example 20.14

The following program uses list input modified with an ampersand (&) to read in the city and the population values of the ten most populous cities in the United States in the year 2000:

DATA citypops;
	infile DATALINES FIRSTOBS = 2;
	length city $ 12;
	input city & pop2000;
	DATALINES;
City  Yr2000Popn
New York  8008278
Los Angeles  3694820
Chicago  2896016
Houston  1953631
Philadelphia  1517550
Phoenix  1321045
San Antonio  1144646
San Diego  1223400
Dallas  1188580
San Jose  894943
;
RUN;
 
PROC PRINT data = citypops;
	title 'The citypops data set';
	format pop2000 comma10.;
RUN;

The citypops data set
Obs	city	pop2000
1	New York	8,008,278
2	Los Angeles	3,694,820
3	Chicago	2,896,016
4	Houston	1,953,631
5	Philadelphia	1,517,550
6	Phoenix	1,321,045
7	San Antonio	1,144,646
8	San Diego	1,223,400
9	Dallas	1,188,580
10	San Jose	894,943

Comparing this program to the previous program you should note four differences:

The LENGTH statement tells SAS, in the compile phase, to define the city variable as a character variable, and to expect the city names to be as long as 12 characters.
The ampersand (&) that follows the city variable in the INPUT statement tells SAS that the city values may contain one or more single embedded blanks. Because the ampersand modifier is used, SAS will read the city value until two or more consecutive blanks are encountered. That is a very important point ... when you use ampersand modified list input, the values that you are reading in must be separated by two or more consecutive blanks. You cannot use any other delimiter to indicate the end of each field.
The commas have been removed from the population values so that SAS can read in the population values using unmodified (standard) list input for the pop2000 variable.
A FORMAT statement has been added to the PRINT procedure just so that the pop2000 values are displayed with commas.

Launch and run the SAS program, and review the output to convince yourself that the values for both the city and pop2000 variables are read in properly.

Example 20.15

Rather than using a LENGTH statement to define the type and length of the city variable, we can place a $w. character informat right in the INPUT statement. The only difference between the following program and the previous one is that the LENGTH statement has been removed, and the $12. character informat has been inserted into the INPUT statement immediately following the city variable's ampersand (&) modifier:

DATA citypops;
infile DATALINES FIRSTOBS = 2;
input city & $12. pop2000;
DATALINES;
City  Yr2000Popn
New York  8008278
Los Angeles  3694820
Chicago  2896016
Houston  1953631
Philadelphia  1517550
Phoenix  1321045
San Antonio  1144646
San Diego  1223400
Dallas  1188580
San Jose  894943
;
RUN;
 
PROC PRINT data = citypops;
	title 'The citypops data set';
	format pop2000 comma10.;
RUN;

The citypops data set
Obs	city	pop2000
1	New York	8,008,278
2	Los Angeles	3,694,820
3	Chicago	2,896,016
4	Houston	1,953,631
5	Philadelphia	1,517,550
6	Phoenix	1,321,045
7	San Antonio	1,144,646
8	San Diego	1,223,400
9	Dallas	1,188,580
10	San Jose	894,943

Launch and run the SAS program, and review the output to convince yourself that the values for both the city and pop2000 variables are again read in properly.

The Colon (:) Modifier

The colon (:) modifier allows us to use list input to read nonstandard data values and character values that are longer than eight characters, but which contain no embedded blanks. The colon (:) indicates that values are read until a blank (or other delimiters) is encountered, and then an informat is applied. If an informat for reading character values is specified, the w value specifies the variable's length, overriding the default length of 8.

Example 20.1

The following program uses the colon (:) modifier to tell SAS to expect commas when reading in the values for the pop2000 variable:

DATA citypops;
	infile DATALINES FIRSTOBS = 2;
	input city & $12. pop2000 : comma.;
	DATALINES;
City  Yr2000Popn
New York  8,008,278
Los Angeles  3,694,820
Chicago  2,896,016
Houston  1,953,631
Philadelphia  1,517,550
Phoenix  1,321,045
San Antonio  1,144,646
San Diego  1,223,400
Dallas  1,188,580
San Jose  894,943
;
RUN;

PROC PRINT data = citypops;
	title 'The citypops data set';
	format pop2000 comma10.;
RUN;

The citypops data set
Obs	city	pop2000
1	New York	8,008,278
2	Los Angeles	3,694,820
3	Chicago	2,896,016
4	Houston	1,953,631
5	Philadelphia	1,517,550
6	Phoenix	1,321,045
7	San Antonio	1,144,646
8	San Diego	1,223,400
9	Dallas	1,188,580
10	San Jose	894,943

Comparing this program to the previous program you should note just two differences:

The commas have been added back into the population values so that we can see how to use the colon (:) modifier to read in nonstandard data values while still using list input.
The colon (:) and comma. informat that follows the pop2000 variable in the INPUT statement tells SAS to expect population values to contain nonstandard characters — commas, in this particular instance. As illustrated here, we need not specify a w value when using the COMMAw.d informat here. That's because list input just reads each value until a blank is detected. (This differs from using a numeric informat with formatted input, in which we must specify a w value in order to tell SAS how many columns to read.)

Launch and run the SAS program, and review the output to convince yourself that the values for both the city and pop2000 variables are again read in properly.

Comparing Formatted Input and Modified List Input

It is important to keep in mind that informats work differently in modified list input than they do in formatted input. So, let's emphasize the point! With formatted input, the informat determines both the length of character variables and the number of columns that are read. The same number of columns are read from each record. For example, the following INPUT statement using formatted input:

input @1 City $12. @15 Pop2000 comma10.;

uses the $12. character informat to tell SAS to set the length of the city variable to 12 as well as to read columns 1 to 12 when reading in these data values:

City	Pop2000
New York	8,008,278
Los Angeles	3,694,820
Chicago	2,896,016
Huston	1,953,631
Philadelphia	1,517,550

The informat in the modified list input, on the other hand, determines only the length of the modified variable, not the number of columns that are read. Here:

input city & $12. pop2000 : comma.;

the raw data values are read until two consecutive blanks are encountered when reading in these data values suitable for list input:

City	Pop2000
New York	8,008,278
Los Angeles	3,694,820
Chicago	2,896,016
Huston	1,953,631
Philadelphia	1,517,550

20.5 - Mixing Input Styles

Of course, the first thing we always do when trying to read data from a raw data file into a SAS data set is to review the data to determine whether we need to use column, formatted, or list input to read in the data values. There are some situations, however, in which just one input method doesn't do the trick. In those cases, we'll want to consider mixing input styles, that is, using more than one input method simultaneously. Let's take a look at an example!

Example 20.17

The following program illustrates using column input, list input, and formatted input simultaneously to read in data concerning five U.S. national parks:

DATA nationalparks;
    	input ParkName $ 1-22 State $ Year @40 Acreage comma9.;
    	DATALINES;
    Yellowstone           ID/MT/WY 1872    4,065,493
    Everglades            FL 1934          1,398,800
    Yosemite              CA 1864            760,917
    Great Smoky Mountains NC/TN 1926         520,269
    Wolf Trap Farm        VA 1966                130;
RUN;
PROC PRINT data = nationalparks;
    format acreage comma9.;
RUN;

The SAS System
Obs	ParkName	State	Year	Acreage
1	Yellowstone	ID/MT/WY	1872	4,065,493
2	Everglades	FL	1934	1,398,800
3	Yosemite	CA	1864	760,917
4	Great Smoky Mountains	NC/TN	1926	520,269
5	Wolf Trap Farm	VA	1966	130

Reviewing the data, you can see that:

Column input is an appropriate method for the first field (ParkName) because the values can be read as standard character values and are arranged in a neatly defined column.
The next two fields are candidates for list input, as they are separated by a single blank while the first (State) contains standard character values and the second (Year) contains standard numeric values.
The values in the last field (Acreage) are arranged in a neatly defined column, but the values require an informat. Therefore, formatted input is an appropriate method.

As you can see by the INPUT statement, mixing the input styles is easily achieved by simply using the different methods within the same INPUT statement. Launch and run the SAS program, and review the output to convince yourself that SAS read in the data properly.

20.6 - Summary

In this lesson, we extended our knowledge of how to read data from various types of data files into SAS data sets.

The homework for this lesson will give you more practice with these techniques so that you become even more familiar with how they work and can use them in your own SAS programming.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility