Lesson 20: More on Importing Data -- Part I

Lesson 20: More on Importing Data -- Part I

Overview

In STAT 480, we learned how to read only the most basic data files into a SAS data set. In this lesson (and the next), we'll extend our knowledge in this area by learning how to read just about any data file into SAS — no matter how messy or unstructured the input data file is. In most cases, the data files will be raw ASCII data files that are obtained from exporting data from some other PC software.

Objectives

Upon completion of this lesson, you should be able to:

Upon completing this lesson, you should be able to do the following:

  • read raw data separated by spaces into a SAS data set (that is, use list input)
  • read raw data arranged in columns into a SAS data set (that is, use column input)
  • read raw data not in standard format into a SAS data set (that is, use formatted input)
  • mix list, column, and formatted input styles to read raw data into a SAS data set
  • determine when list input, column input, formatted input, or some combination of the three styles should be used to input a raw data file
  • understand that the lengths of numeric variables are set to 8 by default and therefore do not necessarily coincide with the widths of the numeric informats used in an INPUT statement
  • state the difference between fixed-length record data files and variable-length record data files
  • determine when it is appropriate, and how, to use the INFILE statement's PAD option
  • decide when it is appropriate, and how, to use the INFILE statement's MISSOVER option
  • determine when it is appropriate, and how, to use the INFILE statement's DLM= option
  • decide when it is appropriate, and how, to use the INFILE statement's DSD option
  • determine when it is appropriate, and how, to use the INFILE statement's FIRSTOBS= option
  • state how to read missing values when using list input
  • determine when it is appropriate, and how, to specify a range of numeric or character variables in the INPUT statement
  • utilize the LENGTH statement to modify the length of a character or numeric variable when appropriate
  • apply the ampersand (&) modifier with list input to read character values that contain embedded blanks
  • insert the colon (:) modifier with list input to read nonstandard data values and character values that are longer than eight characters, but which have no embedded blanks
  • explain why with formatted input, the informat determines both the length of character variables and the number of columns that are read
  • explain why the informat in the modified list input determines only the length of the modified variable, not the number of columns that are read

20.1 - Three Styles of Input

20.1 - Three Styles of Input

As you know from your work in STAT 480, the form of your DATA step's INPUT statement depends on how the data values that you are trying to read in are arranged. If all of the data values are standard numeric or character values that are arranged in neatly defined columns as they are here:

LocationDirectionPoint##
DinosaurNMWest26
Ellis IslandNMEast10
EvergladesNPEast52
Grand CanyonNPWest53
Great Smoky MountainsNPEast310
Hawaii VolcanoesNPWest22
Lava BedsNMWest11
Statue of LibertyNMEast10
Theodore RooseveltNPWest22
YellowstoneNPWest911
YosemiteNPWest213

then the data values can be read in using column input. If the data values are arranged in neatly defined columns, but contain some nonstandard characters, such as the dollar sign ($) here:

Book TitleAuthorPrice#
The Kissing HandPenn, Audrey$9.9532
Voice in the ParkBrowne, Anthony$7.9932
Anno's JourneyAnno, Mitsumasa$7.9948
An Egg is QuietAston, Sianna Hutts$11.5536
FlotsamWiesner, David$11.5640
The Snowy DayKeats, Ezra Jack$6.9940
CorduroyFreeman, Don$11.5528
Snowflake BentleyMartin, Jacqueline$10.8832
Knuffle BunnyWilliems, Mo$10.8740
Bear Snores onWilson, karma$10.7232
Rattletrap CarRoot, Phyllis$7.9940
oliver Finds His WayRoot, Phyllis$11.2440

then the data values must be read in using formatted input. And, if the data values are not arranged in neatly defined columns, but are separated by at least one space and contain no special characters, as they are here:

Barnes 21 4.9000 4.620 14.700 43.000 92.000 32.000 34.000
Williams 21 5.600 4730 15.000 44.000 94.000 32.000 34.000
Jackson 21 6.700 4.650 14.000 41.000 88.000 30.000	34.000
Teel 21 11.100 4.780 13.700 39.000 82.000 29.00 35.000
Smith 21 9.200 4.6013.300 39.000 83.000 29.000 34.000
Washington 21 10.100 4.370 13.700 40.000 91.000 31.000 34.000
Stewart 21 4.600 3.930 13.200 37.00 95.00 34.000 36.000
Taylor 21 6.800 4.620 13.70 40.000 86.000 30.000 35.000
Lindsay 21 6.800 4.620 13.700 40.000 86.000 30.000 35.000
Czekaj 21 51.00 4.210 13.100 39.00 91.000 31.000 34.000

then the data values must be read in using list input.

Just in case you haven't had much of a chance to practice these three styles of input since having taken Stat 480, we'll review them briefly in this section. Then, in the remaining part of this lesson (and the next!), we'll learn how to extend the basic styles of input so that we can read just about any raw data file we want into a SAS data set.

One administrative comment first! We'll look at a number of examples in this lesson and throughout the remainder of the course. For the sake of simplicity, whenever possible, we'll read raw data instream using a DATALINES statement (or the equivalent CARDS statement) rather than from an external raw data file using an INFILE statement. This is the lazy learner's way as it will save us from having to download the data file, save the file on our computers, and edit the INFILE statement to reflect the location of the file. Be aware though that, in practice, nine times out of ten, you'll read data from an external raw data file using an INFILE statement.

Column Input

Because column input allows you to read variable values that occupy the same columns within each record, you'll often hear it described as "reading fixed-field data." As you know, to use column input, we simply list the variable names in the INPUT statement, immediately following each variable name with its corresponding column positions in each of the data lines.

Example 20.1

The following program uses column input to read in the values of one character variable (name) and four numeric variables (subj, gender, height, and weight) into a temporary SAS data set called temp:

DATA temp;
input subj 1-4 name $ 6-23 gender 25 height 27-28 weight 30-32;
CARDS;
1024 Alice Smith        1 65 125
1167 Maryann White      1 68 140
1168 Thomas Jones       2 68 190
1201 Benedictine Arnold 2 68 190
1302 Felicia Ho         1 63 115
	;
RUN;
 
PROC PRINT data=temp;
	title 'Output dataset: TEMP';
RUN;

Output dataset: TEMP
Obssubjnamegenderheightweight
11024Alice Smith165125
21167Maryann White168140
31168Thomas Jones268190
41201Benedictine Arnold268190
51302Felicia Ho163115

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly.

Here are the important points to recall about column input:

  • When using column input, you are not required to indicate missing values with a placeholder, such as a period. That is, missing values can be left blank.
  • Column input uses the columns specified to determine the length of character variables, thereby allowing character values to exceed the default 8 characters and to have embedded spaces.
  • Column input allows fields to be skipped altogether or to be read in any order.
  • Column input allows only part of a value to be read and allows values to be re-read.
  • Spaces are not required between the data values.

Now, let's investigate how column input can be modified to accommodate reading in values that contain special characters.

Formatted Input

The fundamental difference between column input and formatted input is that column input is only appropriate for reading standard numeric data, while formatted input allows us to read both standard and nonstandard numeric data. That is, formatted input combines the features of column input with the ability to read nonstandard data values.

Example 20.2

The following program uses formatted input to read two standard numeric variables (subj and height), two standard character variables (f_name and l_name), and two nonstandard numeric variables (wt_date and calorie) into a temporary SAS data set called temp:

DATA temp;
  input @1  subj 4. 
		@6  f_name $11. 
		@18 l_name $6.
		+3 height 2. 
		+5 wt_date mmddyy8. 
		+1 calorie comma5.;
	format wt_date mmddyy8. calorie comma5.;
	DATALINES;
1024 Alice       Smith  1 65 125 12/1/95  2,036
1167 Maryann     White  1 68 140 12/01/95 1,800
1168 Thomas      Jones  2    190 12/2/95  2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia     Ho     1 63 115 1/1/96   1,972
			;
		RUN;
 
PROC PRINT data = temp;
	title 'Output dataset: TEMP';
	id subj;
RUN;

Output dataset: TEMP
subjf_namel_nameheightwt_datecalorie
1024AliceSmith6512/01/952,036
1167MaryannWhite6812/01/951,800
1168ThomasJones.12/02/952,302
1201BenedictineArnold6811/30/952,432
1302FeliciaHo6301/01/961,972

Recall that the @n absolute pointer control tells SAS to move the input pointer to a specific column number n. For example, the INPUT statement tells SAS to move the input pointer to column 18 before starting to read the values for the l_name variable. Recall that the +n relative pointer control tells SAS to move the input pointer forward n columns to a column number that is relative to the current position. For example, the INPUT statement tells SAS, after reading the values for the l_name variable, to move the input pointer 3 positions to the right before starting to read the values for the height variable.

The numeric informat 4. that follows subj in the INPUT statement tells SAS that the values for the numeric variable subj occupy four columns. The character informat $11. that follows f_name in the INPUT statement tells SAS that the values for the character variable f_name occupy eleven columns. And, the special informats mmddyy8. and comma5. that follow wt_date and calorie, respectively, in the INPUT statement tell SAS what to expect when reading in the values for the numeric variables wt_date and calorie.

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly. Oh ... you might want to also recall that the PRINT procedure displays the values for wt_date and calorie in a meaningful way only because of the presence of the FORMAT statement in the DATA step. If this doesn't sound familiar, you might want to remove the FORMAT statement and re-run the program to see the less than helpful values that are displayed for the wt_date variable.

List Input

If the data values that you are trying to read into a SAS data set are not arranged in neatly defined columns but are separated by at least one space and contain no special characters, then you must use list input. To use list input, you simply place your desired variable names in your INPUT statement in the same order that your data fields appear in your input data file.

Example 20.3

The following program uses list input to read in the subject number, name, gender, weight, and height of five individuals into a temporary SAS data set called temp:

DATA temp;
  input subj name $ gender height weight;
  CARDS;
  1024 Alice 1 65 125
  1167 Maryann 1 68 140
  1168 Thomas 2 68 190
  1201 Benedictine . 68 190
  1302 Felicia 1 63 115
  ;
RUN;
 
PROC PRINT data=temp NOOBS;
title 'Output dataset: TEMP';
RUN;

Output dataset: TEMP
subjnamegenderheightweight
1024Alice165125
1167Maryann168140
1168Thomas268190
1201Benedict.68190
1302Felicia163115

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly.

Although list input might be the easiest input style to use, the ease comes with a bit of a price. In its default form, list input does have several limitations, including:

  • Fields must be separated by at least one blank (or other delimiter).
  • Fields must be read in order from left to right.
  • You cannot skip or re-read fields.
  • Missing values must be represented by a placeholder such as a period. (A blank field causes the matching of variable names and values to get out of sync.)
  • Character values can't contain embedded blanks (or other delimiters).
  • Although the width of a field can be greater than eight columns, both character and numeric variables have a default length of 8 bytes. Character values that are longer than eight characters are truncated.
  • Data must be in standard character or numeric format.

20.2 - Issues with Reading Fixed-Field Data

20.2 - Issues with Reading Fixed-Field Data

For the most part, using column and formatted input to read in data values that are arranged in neatly defined columns is straightforward. There are though — as always — a few sticking points that we'll address in this section. First, we'll learn not to be surprised when the lengths of our numeric variables do not coincide with the widths of the numeric informats we use in our INPUT statement. Then, we'll investigate what to do when trying to read in fixed-field data from a raw data file that contains records of different lengths.

Lengths of Variables

You might recall that the descriptor portion of a data set contains information about the attributes of each variable in your data set. The length of each variable in your data sets is one such attribute. When you use informats to read in some of your data values, it pays to know how SAS defines the lengths of your variables when it processes your DATA step. Let's make our discussion of this concrete by taking a look at an example.

Example 20.4

The following program is identical to the program in Example 20.2, except the PRINT procedure has been replaced with the CONTENTS procedure so that we can explore how SAS defines the lengths of variables:

DATA temp;
input @1  subj 4. 
		@6  f_name $11. 
		@18 l_name $6.
		+3 height 2. 
		+5 wt_date mmddyy8. 
		+1 calorie comma5.;
format wt_date mmddyy8. calorie comma5.;
DATALINES;
1024 Alice       Smith  1 65 125 12/1/95  2,036
1167 Maryann     White  1 68 140 12/01/95 1,800
1168 Thomas      Jones  2    190 12/2/95  2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia     Ho     1 63 115 1/1/96   1,972
;
RUN;
 
PROC CONTENTS data = temp;
	title 'Contents of TEMP Data Set';
RUN;

Contents of TEMP Data Set

The CONTENTS Procedure
Data Set NameWORK.TEMPObservations5
Member TypeDATAVariables6
EngineV9Indexes0
Created2023/09/12 16:10:42Observation Length56
Last Modified2023/09/12 16:10:42Deleted Observations0
Protection CompressedNO
Data Set Type SortedNO
Label   
Data RepresentationWINDOWS_64  
Encodingshift-jis Japanese (SJIS)  
Engine/Host Dependent Information
Data Set Page Size65536
Number of Data Set Pages1
First Data Page1
Max Obs per Page1167
Obs in First Data Page5
Number of Data Set Repairs0
ExtendObsCounterYES
FilenameC:\Users\klc137\AppData\Local\Temp\SAS Temporary Files\_TD1668_E1-055551_\temp.sas7bdat
Release Created9.0401M5
Host CreatedX64_10PRO
Owner NamePSU\klc137
File Size128KB
File Size (bytes)131072
Alphabetic List of Variables and Attributes
#VariableTypeLenFormat
6calorieNum8COMMA5.
2f_nameChar11 
4heightNum8 
3l_nameChar6 
1subjNum8 
5wt_dateNum8MMDDYY8.

Launch and run  the SAS program, and when reviewing the output of the CONTENTS procedure, direct your attention to this portion:

Alphabetic List of Variables and Attributes
#VariableTypeLenFormat
6calorienum8COMMA5.
2f_nameChar11 
4heightNum8 
3l_nameChar6 
1subjNum8 
5wt_dateNu8MMDDYY8.

Let's see what we can learn from this ... SAS defined the length of the character variables — f_name and l_name — as 11 and 6, respectively. Hmm ... is it just a coincidence that the informats we used in the INPUT statement for these two variables were $11.? and $6., respectively? Of course not! In general, during the compile phase, SAS defines the lengths of the character variables in the program data vector to be the same as the field width w specified by the $w. character informat.

Can the same generalization be made for the numeric variables, such as subj, height, wt_date, and calorie? Apparently not! For example, we used the numeric informat of 4. to read in the subj variable, and yet SAS assigned the subj variable a length of 8 bytes. Similarly, we used the comma5. informat to read in the calorie variable, and yet SAS assigned the calorie variable a length of, not 5, but 8 bytes. In general, by default, SAS stores a numeric value— no matter how many digits the value contains — as a floating-point number in 8 bytes of storage. That is, the length of a stored numeric variable is not affected by an informat's width nor by any other column specifications in an INPUT statement. Unfortunately, that doesn't mean you can be lazy and use a default width of 8 for all of your numeric informats. If you do, you'll quickly see that your data values are not read properly.

Example 20.5

The following program is similar to the program in the previous example, except all of the input pointer controls have been changed to @n absolute pointer controls and the numeric informat for the height variable has been changed from 2. to 8.:

DATA temp;
input @1  subj 4. 
		@6  f_name $11. 
		@18 l_name $6.
		@30 height 8. 
		@34 wt_date mmddyy8. 
		@43 calorie comma5.;
format wt_date mmddyy8. calorie comma5.;
DATALINES;
1024 Alice       Smith  1 65 125 12/1/95  2,036
1167 Maryann     White  1 68 140 12/01/95 1,800
1168 Thomas      Jones  2    190 12/2/95  2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia     Ho     1 63 115 1/1/96   1,972
;
RUN;
 
PROC PRINT data = temp;
	title 'The TEMP Data Set';
RUN;

The TEMP Data Set
Obssubjf_namel_nameheightwt_datecalorie
11024AliceSmith.12/01/952,036
21167MaryannWhite.12/01/951,800
31168ThomasJones.12/02/952,302
41201BenedictineArnold.11/30/952,432
51302FeliciaHo.01/01/961,972

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that SAS had trouble reading the data values for the height variable. If you take a look at the input data, you can see why. If you start in column 30, where the height values begin, and count eight columns to the right, you should see that SAS encounters two blank spaces — the one between the height and weight values and the one between the weight and date values. Because a blank space is a character value, and the height variable is defined as numeric, SAS gets confused, and displays an Invalid data message in the log window:

NOTE: Invalid data for height in line 299 30-37.
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
           1024 Alice       Smith  1 65 125 12/1/95  2,036
      subj=1024 f_name=Alice l_name=Smith height=. wt_date=12/01/95 calorie=2,036 _ERROR_=1 _N_=1
NOTE: Invalid data for height in line 300 30-37.
          1167 Maryann     White  1 68 140 12/01/95 1,800
      subj=1167 f_name=Maryann l_name=White height=. wt_date=12/01/95 calorie=1,800 _ERROR_=1 _N_=2
NOTE: Invalid data for height in line 301 30-37.
          1168 Thomas      Jones  2    190 12/2/95  2,302
      subj=1168 f_name=Thomas l_name=Jones height=. wt_date=12/02/95 calorie=2,302 _ERROR_=1 _N_=3
NOTE: Invalid data for height in line 302 30-37.
          1201 Benedictine Arnold 2 68 190 11/30/95 2,432
      subj=1201 f_name=Benedictine l_name=Arnold height=. wt_date=11/30/95 calorie=2,432 _ERROR_=1 _N_=4
NOTE: Invalid data for height in line 303 30-37.
          1302 Felicia     Ho     1 63 115 1/1/96   1,972
      subj=1302 f_name=Felicia l_name=Ho height=. wt_date=01/01/96 calorie=1,972 _ERROR_=1 _N_=5
NOTE: The data set WORK.TEMP has 5 observations and 6 variables.
NOTE: DATA statement used (Total process time):
          real time           0.01 seconds
          cpu time            0.01 seconds

and assigns missing values to the height variable. Because we used @n absolute pointer controls to tell SAS where to begin reading the wt_date and calorie values, SAS has no trouble reading in the data for those variables.

In a nutshell, the moral of the story of the two previous examples is that you need to make sure that you define the widths of your character and numeric informats to be the number of columns the data values occupy in the input data, not the length of their resulting variables.

Reading Variable-Length Records

When you are using column or formatted input, you might run into trouble when the records in your input data file are of different lengths. Data files that have an end-of-record marker after a fixed, predetermined number of columns are called fixed-length record data files. This data file:

fixed-length record data file example

is a fixed-length record data file, because each record ends in the same column. Reading this data file using column input is a straightforward exercise.

Data files that have an end-of-record marker after the last field in each record are called variable-length record data files. This data file:

variable-length record data file example

is a variable-length record data file, because the length of the records varies. Reading this data file using column input requires using the INFILE statement's PAD option or MISSOVER option to tell SAS how to behave when it reaches an end-of-record marker before all of the variables in a record have been read. Let's investigate this more thoroughly.

First, an important administrative comment! So far, we've been able to get away with illustrating all of the above SAS features by reading the data instream using a DATALINES statement. Since the examples that follow concern options that are available for the INFILE statement, we can't be lazy anymore. For each example, you'll have to first right-click on the link to the raw ASCII data file to save it to a convenient location on your computer, and then edit the program's INFILE statement to reflect that location. When all is said and done, you'll appreciate even more the lazy learner's way!

Example 20.6

The following program attempts to read the addresses.dat raw data file:

variable-length record data file example

using formatted input:

DATA temp;
    infile 'C:\yourdrivename\Stat481WC\01importingI\sasndata\addresses.dat';
      input @1 subj 4. name $ 6-23 street $ 27-45;
RUN;
PROC PRINT;
    title 'Temp data set';
RUN;

First, click on the link to the addresses.dat data file in order to save the file to a convenient location on your computer. Then, upon launching the SAS program, edit the INFILE statement so it reflects the location in which you saved the file. Finally, run  the SAS program, and review the output. You should see that something has clearly gone awry:

Temp data set
Obssubjnamestreet
11024Alice Smith1167 Maryann White
211681201 Benedict A1302 Felicia Ho

If you take a look at the first record in the data file, you can see how things go downhill from the start. In trying to read the street address of subj 1024, SAS encounters an end-of-record marker before getting to the end of the field at column 45. Therefore, SAS proceeds to do what it has been told to do by default, namely to advance to the next line to look for the data field. That's why it reads "1167 Maryann White" as the street address for subj 1024. Having reached the end of the INPUT statement, SAS then proceeds to the next line and begins to read the data for subj 1168. This time, SAS can't complete reading the columns for name or street before encountering the end-of-record marker. Again, SAS advances to the next line and reads "1201 Benedictine A" as the name for subj 1168. SAS tries to read columns 27-45 as the street address for subj 1168, but again encounters the end-of-record marker before getting to column 45. So, SAS advances to the next line to look for the data field, and in so doing reads "1302 Felicia Ho" as the street address. Having reached the end of the INPUT statement, SAS proceeds to the next line, only to discover that there are no more records to read. The DATA step stops, and we are left with a messed up SAS data set! The following two examples help clean up our mess.

Example 20.7

As you now know, by default, SAS goes to the next data line to read more data if SAS has reached the end of a data line and there are still more variables in the INPUT statement that have not been assigned values. The following program uses the INFILE statement's MISSOVER option to tell SAS not to advance to the next data line to read more values to complete observation, but rather to assign missing values to the remaining variables:

DATA temp;
    infile 'C:\yourdrivename\Stat481WC\01importingI\sasndata\addresses.dat' MISSOVER;
      input @1 subj 4. name $ 6-23 street $ 27-45;
RUN;
PROC PRINT;
    title 'Temp data set';
RUN;

If you haven't already done so, click on the link to the addresses.dat data file in order to save the file to a convenient location on your computer. And, upon launching the SAS program, edit the INFILE statement so it reflects the location in which you saved the file. Then, run  the SAS program, and review the output. You should see that our resulting data set is much improved:

Temp data set
Objsubjnamestreet
11024Alice Smith 
21167Maryann White756 Washington Ave.
31168Thomas Jones 
41201Benedictine Arnold 
51302Felicia Ho110 Jefferson Blvd.

At least this time, SAS associates the data values that it does read in with the correct variables. It's still not a perfect solution though, is it? In some cases, you might instead want SAS to read in the portions of the data fields that are available, such as 123 Elm St. for the street address for subj 1024, rather than automatically assigning a missing value when an end-of-record marker is encountered. That's where the INFILE statement's PAD option proves useful.

Example 20.8

The following program uses the INFILE statement's PAD option to tell SAS to pad each record in the data file with blanks so that all of the data lines have the same length:

DATA temp;
    infile 'C:\yourdrivename\Stat481WC\01importingI\sasndata\addresses.dat' PAD;
      input @1 subj 4. name $ 6-23 street $ 27-45;
RUN;
PROC PRINT;
    title 'Temp data set';
RUN;

Again, if you haven't already done so, click on the link to the addresses.dat data file in order to save the file to a convenient location on your computer. And, upon launching the SAS program, edit the INFILE statement so it reflects the location in which you saved the file. Then, run  the SAS program, and review the output. You should see that we have achieved the most complete and correct data set yet:

Temp data set
Obssubjnamestreet
11024Alice Smith123 Elm St.
21167Maryann White756 Washington Ave.
31168Thomas Jones 
41201Benedictine Arnold245 Ferguson Ave.
51302Felicia Ho110 Jefferson Blvd.

20.3 - Issues with Reading Free-Format Data

20.3 - Issues with Reading Free-Format Data

Just as you might encounter situations in which reading data using column and formatted input is not a straightforward exercise, so is the case with list input. In this section, we'll investigate two such situations. First, we'll address what you need to do when your input data file contains some missing values. Then, we'll investigate using the LENGTH statement as a way to tell SAS to set the length of character values to be something other than the default 8 characters. Let's jump right in again!

Reading Missing Values Using List Input

As mentioned in our review of list input, whenever our data contain missing values, we have to make sure that there is a placeholder in place of the missing values ... a period or some other character. If not, when using list input, the blank field causes the matching of variable names and values to get out of sync. You can imagine a situation in which you have a large data file ... let's say 5,000 records and 100 variables ... with no placeholders in place for the data values that are missing. You certainly wouldn't want to have to enter missing value placeholders into the data file just so you could read the data file into a SAS data set. Fortunately, there are ways around this problem! You've already seen one way in the last section, namely the use of the INFILE statement's MISSOVER option. That option will serve you well when using list input too, providing all of your missing values are at the end of a record. The other way, namely the use of the INFILE statement's DSD option, will serve you well when your missing values are at the beginning or middle of a record.

Example 20.9

The following program uses list input and the INFILE statement's MISSOVER option to read in the number of books five children read each week in a library's summer reading program when some of the values are missing at the end of a record:

DATA reading;
	infile DATALINES MISSOVER;
	input Name $ Week1-Week5;
	DATALINES;
Robin 3 2 5 1 6
Jack 4 4 4 3 4
Tim 3 0 0
Martin 1 0 1 1
Caroline 2 3 4 5 6
RUN;
 
PROC PRINT data = reading;
	title 'Summer Reading Program';
	var Name Week1-Week5;
RUN;

Summer Reading Program
ObsNameWeek1Week2Week3Week4Week5
1Robin32516
2Jack44434
3Tim300..
4Martin1011.
5Caroline23456

First, review the data to note that Tim has no values recorded for the fourth and fifth week and Martin has no values recorded for the fifth week. Also, note that just so you don't have to go through the rigamarole of downloading a data file and editing the INFILE statement, we are back to using the DATALINES statement to read the data instream. But, in order to still be able to take advantage of the INFILE statement's options, we have to tell SAS to expect the data after the DATALINES statement. That's why the:

infile DATALINES ...;

statement.

We need to use the INFILE statement just so we can take advantage of the statement's MISSOVER option. Just as is the case for column and formatted input, the MISSOVER option tells SAS not to advance to the next record when it doesn't find values in the current line for all of the INPUT statement variables. At the end of the current record, values that are expected but not found are set to missing.

One more thing to note is the range of numeric variables:

Week1-Week5

that appears in the INPUT statement. When the variable values in the input raw data file are sequential and separated by a blank (or some other such delimiter), you can specify a range of variables in the INPUT statement, as we did in the above program. This is especially useful if your data contains similar variables. For example, if your data contain the answers to a questionnaire, you might want to call your variables q1, q2, q3, ... . Note that you can also specify a range of variables in the PRINT procedure's VAR statement.

Finally, launch and run  the SAS program, and review the output to convince yourself that the data are read in properly.

It is important to keep in mind that the MISSOVER option works only for missing values that occur at the end of the record. If your data contains missing values at the beginning or middle of a record, then you need instead to take advantage of the INFILE statement's DSD option.

Example 20.10

 The following program uses list input and the INFILE statement's DSD option to read in the results of a survey of five people when some of the values are missing in the middle of a record:

DATA survey;
	infile DATALINES DLM=',' DSD;
	input Name $ (Q1-Q5) ($);
	DATALINES;
Robert,,A,C,A,D
William,B,C,A,D,A
Linda,C,B,,A,C
Lisa,D,D,D,C,A
Katherine,A,B,C,D,A
RUN;
 
PROC PRINT data = survey;
	title 'Survey Results';
	var Name Q1-Q5;
RUN;

Survey Results
ObsNameQ1Q2Q3Q4Q5
1Robert ACAD
2WilliamBCADA
3LindaCB AC
4LisaDDDCA
5KatherinABCDA

First, review the data to note that Robert and Linda each have one missing value in the middle of their record. Also, note that the data values are delimited with commas now rather than blanks. That's why the:

DLM=','

option appears in the INFILE statement. You might recall from Stat 480 that DLM stands for delimiter. Then, you should notice the range of variables:

(Q1-Q5) ($)

that appears in the INPUT statement. We've used the proper syntax for specifying a range of character variables, namely both the variable list and the $ sign are enclosed in parentheses.

Hmmm ... what else? Oh yeah ... the whole point of this example ... the DSD option that appears in the INFILE statement. The DSD option changes how SAS treats delimiters when list input is used. Specifically, the DSD option:

  • sets the default delimiter to a comma
  • treats two consecutive delimiters as a missing value
  • removes quotation marks from values

That first point clearly makes our inclusion of the DLM=',' option superfluous, but that's okay. It's the second point that is the most relevant and the one that makes our program here work. Let's check it out! Launch and run  the SAS program, and review the output to convince yourself that the data are read in properly. Oops ... not quite! The name Katherine contains nine characters. Because SAS defines the length of character variables to be 8 bytes by default, SAS truncates her name and stores it as Katherin without the ending e. We'll learn how to rectify that problem soon, but first one more example of the DSD option.

Example 20.11

The following program illustrates that the DSD option can also be used i) when there is a missing value at the beginning of a record, and ii) when the data are delimited by blanks (in conjunction with the DLM= option):

DATA survey;
	infile DATALINES DLM=' ' DSD;
	input Name $ (Q1-Q5) ($);
	DATALINES;
Robert  A C A D
William B C A D A
Linda C B  A C
 D D D C A
Katherine A B C D A
RUN;
PROC PRINT data = survey;
	title 'Survey Results';
	var Name Q1-Q5;
RUN;

Survey Results
ObsNameQ1Q2Q3Q4Q5
1Robert ACAD
2WilliamBCADA
3LindaCB AC
4 DDDCA
5KatherinABCDA

The only difference between this program and the previous one is that the comma delimiters have been replaced with blanks, and the first value (Lisa) in the fourth record has been deleted so that our data set contains a missing value at the beginning of a record. Because the data are now delimited with blanks we use the DLM = ' ' option. Launch and run  the SAS program, and review the output to convince yourself that the data are read in properly. Oops ... let's go fix that problem with Katherine now.

Modifying the Length of Character Values

As mentioned previously, when you use list input to read raw data, character values are assigned a default length of 8. When we ran the last two programs, that's why Katherine was truncated to Katherin. The LENGTH statement will help us resolve this problem.

Example 20.12

The following program is identical to the previous program, except a LENGTH statement has been added to tell SAS to define the length of the character variable Name to be 9 rather than the default 8:

DATA survey;
	infile DATALINES DLM=' ' DSD;
	length Name $ 9;
	input Name $ (Q1-Q5) ($);
	DATALINES;
Robert  A C A D
William B C A D A
Linda C B  A C
 D D D C A
Katherine A B C D A
RUN;
 
PROC PRINT data = survey;
	title 'Survey Results';
	var Name Q1-Q5;
RUN;

Survey Results
ObsNameQ1Q2Q3Q4Q5
1Robert ACAD
2WilliamBCADA
3LindaCB AC
4 DDDCA
5KatherineABCDA

Note that the LENGTH statement was intentionally placed before the INPUT statement. That's because variable attributes are defined when the variable is first encountered in the DATA step. If we placed the LENGTH statement after the INPUT statement, SAS would therefore by default define the length of Name to be 8 characters, not the desired 9. As you can see by the inclusion of the dollar sign ($), the LENGTH statement also tells SAS that Name should be treated as a character variable.

Now, go ahead and launch and run  the SAS program, and review the output to convince yourself that all of the data are read in properly this time!


20.4 - Modifying List Input

20.4 - Modifying List Input

List input can be made even more versatile by using what is called modified list input. Modified list input entails using either the ampersand (&) modifier or the colon (:) modifier:

  • The ampersand (&) modifier allows you to read character values that contain embedded blanks.
  • The colon (:) modifier allows you to read nonstandard data values and character values that are longer than eight characters, but which have no embedded blanks.

Let's take a look at an example in which modified list input would be useful.

Example 20.13

The following program attempts to use list input to read the populations of the ten most populous cities in the United States into a temporary SAS data set called citypops, but the program fails:

DATA reading;
DATA citypops;
   infile DATALINES FIRSTOBS = 2;
   input city pop2000;
   DATALINES;
City  Yr2000Popn
New York  8,008,278
Los Angeles  3,694,820
Chicago  2,896,016
Houston  1,953,631
Philadelphia  1,517,550
Phoenix  1,321,045
San Antonio  1,144,646
San Diego  1,223,400
Dallas  1,188,580
San Jose  894,943
;
RUN;
 
PROC PRINT data = citypops;
   title 'The citypops data set';
RUN;

The citypops data set
Obscitypop2000
1..
2..
3..
4..
5..
6..
7..
8..
9..
10..

In reviewing the data, the first thing you might notice is that this particular input data file contains a header row:

City  Yr2000Popn

that reports the content of each record. You may often find yourself in a situation in which someone has handed you such a data file, that is, one containing headings in addition to the columns of data. In general, that's a good thing, since then you know for sure what each record contains. It creates a problem though for reading in the data unless you tell SAS to disregard the heading information. That's just what the FIRSTOBS = 2 option in the INFILE statement tells SAS to do. It tells SAS to begin reading data at line 2 instead of the default line 1.

Moving past the header row, you should note the important features of the data. The longest city name is 12 characters. Some of the cities — New York, for example — contain embedded blanks. There are two blank spaces between the city names and their populations. Finally, because the population values contain commas, they are nonstandard values that require an informat during input. Given this list of features, it shouldn't be surprising that the standard list input style used in the INPUT statement fails.

Launch and run  the SAS program, and review the output to convince yourself that SAS encounters a serious problem when attempting to read the data into the citypops data set.

The Ampersand (&) Modifier

Because the ampersand (&) modifier allows us to use list input to read character values containing single embedded blanks, it is the tool that we will want to use to read in the city names.

Example 20.14 

The following program uses list input modified with an ampersand (&) to read in the city and the population values of the ten most populous cities in the United States in the year 2000:

DATA citypops;
	infile DATALINES FIRSTOBS = 2;
	length city $ 12;
	input city & pop2000;
	DATALINES;
City  Yr2000Popn
New York  8008278
Los Angeles  3694820
Chicago  2896016
Houston  1953631
Philadelphia  1517550
Phoenix  1321045
San Antonio  1144646
San Diego  1223400
Dallas  1188580
San Jose  894943
;
RUN;
 
PROC PRINT data = citypops;
	title 'The citypops data set';
	format pop2000 comma10.;
RUN;

The citypops data set
Obscitypop2000
1New York8,008,278
2Los Angeles3,694,820
3Chicago2,896,016
4Houston1,953,631
5Philadelphia1,517,550
6Phoenix1,321,045
7San Antonio1,144,646
8San Diego1,223,400
9Dallas1,188,580
10San Jose894,943

Comparing this program to the previous program you should note four differences:

  1. The LENGTH statement tells SAS, in the compile phase, to define the city variable as a character variable, and to expect the city names to be as long as 12 characters.
  2. The ampersand (&) that follows the city variable in the INPUT statement tells SAS that the city values may contain one or more single embedded blanks. Because the ampersand modifier is used, SAS will read the city value until two or more consecutive blanks are encountered. That is a very important point ... when you use ampersand modified list input, the values that you are reading in must be separated by two or more consecutive blanks. You cannot use any other delimiter to indicate the end of each field.
  3. The commas have been removed from the population values so that SAS can read in the population values using unmodified (standard) list input for the pop2000 variable.
  4. A FORMAT statement has been added to the PRINT procedure just so that the pop2000 values are displayed with commas.

Launch and run  the SAS program, and review the output to convince yourself that the values for both the city and pop2000 variables are read in properly.

Example 20.15

Rather than using a LENGTH statement to define the type and length of the city variable, we can place a $w. character informat right in the INPUT statement. The only difference between the following program and the previous one is that the LENGTH statement has been removed, and the $12. character informat has been inserted into the INPUT statement immediately following the city variable's ampersand (&) modifier:

DATA citypops;
infile DATALINES FIRSTOBS = 2;
input city & $12. pop2000;
DATALINES;
City  Yr2000Popn
New York  8008278
Los Angeles  3694820
Chicago  2896016
Houston  1953631
Philadelphia  1517550
Phoenix  1321045
San Antonio  1144646
San Diego  1223400
Dallas  1188580
San Jose  894943
;
RUN;
 
PROC PRINT data = citypops;
	title 'The citypops data set';
	format pop2000 comma10.;
RUN;

The citypops data set
Obscitypop2000
1New York8,008,278
2Los Angeles3,694,820
3Chicago2,896,016
4Houston1,953,631
5Philadelphia1,517,550
6Phoenix1,321,045
7San Antonio1,144,646
8San Diego1,223,400
9Dallas1,188,580
10San Jose894,943

Launch and run  the SAS program, and review the output to convince yourself that the values for both the city and pop2000 variables are again read in properly.

The Colon (:) Modifier

The colon (:) modifier allows us to use list input to read nonstandard data values and character values that are longer than eight characters, but which contain no embedded blanks. The colon (:) indicates that values are read until a blank (or other delimiters) is encountered, and then an informat is applied. If an informat for reading character values is specified, the w value specifies the variable's length, overriding the default length of 8.

Example 20.1

The following program uses the colon (:) modifier to tell SAS to expect commas when reading in the values for the pop2000 variable:

DATA citypops;
	infile DATALINES FIRSTOBS = 2;
	input city & $12. pop2000 : comma.;
	DATALINES;
City  Yr2000Popn
New York  8,008,278
Los Angeles  3,694,820
Chicago  2,896,016
Houston  1,953,631
Philadelphia  1,517,550
Phoenix  1,321,045
San Antonio  1,144,646
San Diego  1,223,400
Dallas  1,188,580
San Jose  894,943
;
RUN;

PROC PRINT data = citypops;
	title 'The citypops data set';
	format pop2000 comma10.;
RUN;

The citypops data set
Obscitypop2000
1New York8,008,278
2Los Angeles3,694,820
3Chicago2,896,016
4Houston1,953,631
5Philadelphia1,517,550
6Phoenix1,321,045
7San Antonio1,144,646
8San Diego1,223,400
9Dallas1,188,580
10San Jose894,943

Comparing this program to the previous program you should note just two differences:

  1. The commas have been added back into the population values so that we can see how to use the colon (:) modifier to read in nonstandard data values while still using list input.
  2. The colon (:) and comma. informat that follows the pop2000 variable in the INPUT statement tells SAS to expect population values to contain nonstandard characters — commas, in this particular instance. As illustrated here, we need not specify a w value when using the COMMAw.d informat here. That's because list input just reads each value until a blank is detected. (This differs from using a numeric informat with formatted input, in which we must specify a w value in order to tell SAS how many columns to read.)

Launch and run  the SAS program, and review the output to convince yourself that the values for both the city and pop2000 variables are again read in properly.

Comparing Formatted Input and Modified List Input

It is important to keep in mind that informats work differently in modified list input than they do in formatted input. So, let's emphasize the point! With formatted input, the informat determines both the length of character variables and the number of columns that are read. The same number of columns are read from each record. For example, the following INPUT statement using formatted input:

input @1 City $12. @15 Pop2000 comma10.;

uses the $12. character informat to tell SAS to set the length of the city variable to 12 as well as to read columns 1 to 12 when reading in these data values:

CityPop2000
New York8,008,278
Los Angeles3,694,820
Chicago2,896,016
Huston1,953,631
Philadelphia1,517,550

The informat in the modified list input, on the other hand, determines only the length of the modified variable, not the number of columns that are read. Here:

input city & $12. pop2000 : comma.;

the raw data values are read until two consecutive blanks are encountered when reading in these data values suitable for list input:

CityPop2000
New York8,008,278
Los Angeles3,694,820
Chicago2,896,016
Huston1,953,631
Philadelphia1,517,550

20.5 - Mixing Input Styles

20.5 - Mixing Input Styles

Of course, the first thing we always do when trying to read data from a raw data file into a SAS data set is to review the data to determine whether we need to use column, formatted, or list input to read in the data values. There are some situations, however, in which just one input method doesn't do the trick. In those cases, we'll want to consider mixing input styles, that is, using more than one input method simultaneously. Let's take a look at an example!

Example 20.17

The following program illustrates using column input, list input, and formatted input simultaneously to read in data concerning five U.S. national parks:

DATA nationalparks;
    	input ParkName $ 1-22 State $ Year @40 Acreage comma9.;
    	DATALINES;
    Yellowstone           ID/MT/WY 1872    4,065,493
    Everglades            FL 1934          1,398,800
    Yosemite              CA 1864            760,917
    Great Smoky Mountains NC/TN 1926         520,269
    Wolf Trap Farm        VA 1966                130;
RUN;
PROC PRINT data = nationalparks;
    format acreage comma9.;
RUN;

The SAS System
ObsParkNameStateYearAcreage
1YellowstoneID/MT/WY18724,065,493
2EvergladesFL19341,398,800
3YosemiteCA1864760,917
4Great Smoky MountainsNC/TN1926520,269
5Wolf Trap FarmVA1966130

Reviewing the data, you can see that:

  • Column input is an appropriate method for the first field (ParkName) because the values can be read as standard character values and are arranged in a neatly defined column.
  • The next two fields are candidates for list input, as they are separated by a single blank while the first (State) contains standard character values and the second (Year) contains standard numeric values.
  • The values in the last field (Acreage) are arranged in a neatly defined column, but the values require an informat. Therefore, formatted input is an appropriate method.

As you can see by the INPUT statement, mixing the input styles is easily achieved by simply using the different methods within the same INPUT statement. Launch and run  the SAS program, and review the output to convince yourself that SAS read in the data properly.


20.6 - Summary

20.6 - Summary

In this lesson, we extended our knowledge of how to read data from various types of data files into SAS data sets.

The homework for this lesson will give you more practice with these techniques so that you become even more familiar with how they work and can use them in your own SAS programming.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility