Lesson 20: More on Importing Data -- Part I
Lesson 20: More on Importing Data -- Part IOverview
In STAT 480, we learned how to read only the most basic data files into a SAS data set. In this lesson (and the next), we'll extend our knowledge in this area by learning how to read just about any data file into SAS — no matter how messy or unstructured the input data file is. In most cases, the data files will be raw ASCII data files that are obtained from exporting data from some other PC software.
Objectives
Upon completing this lesson, you should be able to do the following:
- read raw data separated by spaces into a SAS data set (that is, use list input)
- read raw data arranged in columns into a SAS data set (that is, use column input)
- read raw data not in standard format into a SAS data set (that is, use formatted input)
- mix list, column, and formatted input styles to read raw data into a SAS data set
- determine when list input, column input, formatted input, or some combination of the three styles should be used to input a raw data file
- understand that the lengths of numeric variables are set to 8 by default and therefore do not necessarily coincide with the widths of the numeric informats used in an INPUT statement
- state the difference between fixed-length record data files and variable-length record data files
- determine when it is appropriate, and how, to use the INFILE statement's PAD option
- decide when it is appropriate, and how, to use the INFILE statement's MISSOVER option
- determine when it is appropriate, and how, to use the INFILE statement's DLM= option
- decide when it is appropriate, and how, to use the INFILE statement's DSD option
- determine when it is appropriate, and how, to use the INFILE statement's FIRSTOBS= option
- state how to read missing values when using list input
- determine when it is appropriate, and how, to specify a range of numeric or character variables in the INPUT statement
- utilize the LENGTH statement to modify the length of a character or numeric variable when appropriate
- apply the ampersand (&) modifier with list input to read character values that contain embedded blanks
- insert the colon (:) modifier with list input to read nonstandard data values and character values that are longer than eight characters, but which have no embedded blanks
- explain why with formatted input, the informat determines both the length of character variables and the number of columns that are read
- explain why the informat in the modified list input determines only the length of the modified variable, not the number of columns that are read
20.1 - Three Styles of Input
20.1 - Three Styles of InputAs you know from your work in STAT 480, the form of your DATA step's INPUT statement depends on how the data values that you are trying to read in are arranged. If all of the data values are standard numeric or character values that are arranged in neatly defined columns as they are here:
Location | Direction | Point | # | # |
---|---|---|---|---|
Dinosaur | NM | West | 2 | 6 |
Ellis Island | NM | East | 1 | 0 |
Everglades | NP | East | 5 | 2 |
Grand Canyon | NP | West | 5 | 3 |
Great Smoky Mountains | NP | East | 3 | 10 |
Hawaii Volcanoes | NP | West | 2 | 2 |
Lava Beds | NM | West | 1 | 1 |
Statue of Liberty | NM | East | 1 | 0 |
Theodore Roosevelt | NP | West | 2 | 2 |
Yellowstone | NP | West | 9 | 11 |
Yosemite | NP | West | 2 | 13 |
then the data values can be read in using column input. If the data values are arranged in neatly defined columns, but contain some nonstandard characters, such as the dollar sign ($) here:
Book Title | Author | Price | # |
---|---|---|---|
The Kissing Hand | Penn, Audrey | $9.95 | 32 |
Voice in the Park | Browne, Anthony | $7.99 | 32 |
Anno's Journey | Anno, Mitsumasa | $7.99 | 48 |
An Egg is Quiet | Aston, Sianna Hutts | $11.55 | 36 |
Flotsam | Wiesner, David | $11.56 | 40 |
The Snowy Day | Keats, Ezra Jack | $6.99 | 40 |
Corduroy | Freeman, Don | $11.55 | 28 |
Snowflake Bentley | Martin, Jacqueline | $10.88 | 32 |
Knuffle Bunny | Williems, Mo | $10.87 | 40 |
Bear Snores on | Wilson, karma | $10.72 | 32 |
Rattletrap Car | Root, Phyllis | $7.99 | 40 |
oliver Finds His Way | Root, Phyllis | $11.24 | 40 |
then the data values must be read in using formatted input. And, if the data values are not arranged in neatly defined columns, but are separated by at least one space and contain no special characters, as they are here:
Barnes 21 4.9000 4.620 14.700 43.000 92.000 32.000 34.000 Williams 21 5.600 4730 15.000 44.000 94.000 32.000 34.000 Jackson 21 6.700 4.650 14.000 41.000 88.000 30.000 34.000 Teel 21 11.100 4.780 13.700 39.000 82.000 29.00 35.000 Smith 21 9.200 4.6013.300 39.000 83.000 29.000 34.000 Washington 21 10.100 4.370 13.700 40.000 91.000 31.000 34.000 Stewart 21 4.600 3.930 13.200 37.00 95.00 34.000 36.000 Taylor 21 6.800 4.620 13.70 40.000 86.000 30.000 35.000 Lindsay 21 6.800 4.620 13.700 40.000 86.000 30.000 35.000 Czekaj 21 51.00 4.210 13.100 39.00 91.000 31.000 34.000
then the data values must be read in using list input.
Just in case you haven't had much of a chance to practice these three styles of input since having taken Stat 480, we'll review them briefly in this section. Then, in the remaining part of this lesson (and the next!), we'll learn how to extend the basic styles of input so that we can read just about any raw data file we want into a SAS data set.
One administrative comment first! We'll look at a number of examples in this lesson and throughout the remainder of the course. For the sake of simplicity, whenever possible, we'll read raw data instream using a DATALINES statement (or the equivalent CARDS statement) rather than from an external raw data file using an INFILE statement. This is the lazy learner's way as it will save us from having to download the data file, save the file on our computers, and edit the INFILE statement to reflect the location of the file. Be aware though that, in practice, nine times out of ten, you'll read data from an external raw data file using an INFILE statement.
Column Input
Because column input allows you to read variable values that occupy the same columns within each record, you'll often hear it described as "reading fixed-field data." As you know, to use column input, we simply list the variable names in the INPUT statement, immediately following each variable name with its corresponding column positions in each of the data lines.
Example 20.1
The following program uses column input to read in the values of one character variable (name) and four numeric variables (subj, gender, height, and weight) into a temporary SAS data set called temp:
DATA temp;
input subj 1-4 name $ 6-23 gender 25 height 27-28 weight 30-32;
CARDS;
1024 Alice Smith 1 65 125
1167 Maryann White 1 68 140
1168 Thomas Jones 2 68 190
1201 Benedictine Arnold 2 68 190
1302 Felicia Ho 1 63 115
;
RUN;
PROC PRINT data=temp;
title 'Output dataset: TEMP';
RUN;
Obs | subj | name | gender | height | weight |
---|---|---|---|---|---|
1 | 1024 | Alice Smith | 1 | 65 | 125 |
2 | 1167 | Maryann White | 1 | 68 | 140 |
3 | 1168 | Thomas Jones | 2 | 68 | 190 |
4 | 1201 | Benedictine Arnold | 2 | 68 | 190 |
5 | 1302 | Felicia Ho | 1 | 63 | 115 |
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly.
Here are the important points to recall about column input:
- When using column input, you are not required to indicate missing values with a placeholder, such as a period. That is, missing values can be left blank.
- Column input uses the columns specified to determine the length of character variables, thereby allowing character values to exceed the default 8 characters and to have embedded spaces.
- Column input allows fields to be skipped altogether or to be read in any order.
- Column input allows only part of a value to be read and allows values to be re-read.
- Spaces are not required between the data values.
Now, let's investigate how column input can be modified to accommodate reading in values that contain special characters.
Formatted Input
The fundamental difference between column input and formatted input is that column input is only appropriate for reading standard numeric data, while formatted input allows us to read both standard and nonstandard numeric data. That is, formatted input combines the features of column input with the ability to read nonstandard data values.
Example 20.2
The following program uses formatted input to read two standard numeric variables (subj and height), two standard character variables (f_name and l_name), and two nonstandard numeric variables (wt_date and calorie) into a temporary SAS data set called temp:
DATA temp;
input @1 subj 4.
@6 f_name $11.
@18 l_name $6.
+3 height 2.
+5 wt_date mmddyy8.
+1 calorie comma5.;
format wt_date mmddyy8. calorie comma5.;
DATALINES;
1024 Alice Smith 1 65 125 12/1/95 2,036
1167 Maryann White 1 68 140 12/01/95 1,800
1168 Thomas Jones 2 190 12/2/95 2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia Ho 1 63 115 1/1/96 1,972
;
RUN;
PROC PRINT data = temp;
title 'Output dataset: TEMP';
id subj;
RUN;
subj | f_name | l_name | height | wt_date | calorie |
---|---|---|---|---|---|
1024 | Alice | Smith | 65 | 12/01/95 | 2,036 |
1167 | Maryann | White | 68 | 12/01/95 | 1,800 |
1168 | Thomas | Jones | . | 12/02/95 | 2,302 |
1201 | Benedictine | Arnold | 68 | 11/30/95 | 2,432 |
1302 | Felicia | Ho | 63 | 01/01/96 | 1,972 |
Recall that the @n absolute pointer control tells SAS to move the input pointer to a specific column number n. For example, the INPUT statement tells SAS to move the input pointer to column 18 before starting to read the values for the l_name variable. Recall that the +n relative pointer control tells SAS to move the input pointer forward n columns to a column number that is relative to the current position. For example, the INPUT statement tells SAS, after reading the values for the l_name variable, to move the input pointer 3 positions to the right before starting to read the values for the height variable.
The numeric informat 4. that follows subj in the INPUT statement tells SAS that the values for the numeric variable subj occupy four columns. The character informat $11. that follows f_name in the INPUT statement tells SAS that the values for the character variable f_name occupy eleven columns. And, the special informats mmddyy8. and comma5. that follow wt_date and calorie, respectively, in the INPUT statement tell SAS what to expect when reading in the values for the numeric variables wt_date and calorie.
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly. Oh ... you might want to also recall that the PRINT procedure displays the values for wt_date and calorie in a meaningful way only because of the presence of the FORMAT statement in the DATA step. If this doesn't sound familiar, you might want to remove the FORMAT statement and re-run the program to see the less than helpful values that are displayed for the wt_date variable.
List Input
If the data values that you are trying to read into a SAS data set are not arranged in neatly defined columns but are separated by at least one space and contain no special characters, then you must use list input. To use list input, you simply place your desired variable names in your INPUT statement in the same order that your data fields appear in your input data file.
Example 20.3
The following program uses list input to read in the subject number, name, gender, weight, and height of five individuals into a temporary SAS data set called temp:
DATA temp;
input subj name $ gender height weight;
CARDS;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benedictine . 68 190
1302 Felicia 1 63 115
;
RUN;
PROC PRINT data=temp NOOBS;
title 'Output dataset: TEMP';
RUN;
subj | name | gender | height | weight |
---|---|---|---|---|
1024 | Alice | 1 | 65 | 125 |
1167 | Maryann | 1 | 68 | 140 |
1168 | Thomas | 2 | 68 | 190 |
1201 | Benedict | . | 68 | 190 |
1302 | Felicia | 1 | 63 | 115 |
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the data are read in properly.
Although list input might be the easiest input style to use, the ease comes with a bit of a price. In its default form, list input does have several limitations, including:
- Fields must be separated by at least one blank (or other delimiter).
- Fields must be read in order from left to right.
- You cannot skip or re-read fields.
- Missing values must be represented by a placeholder such as a period. (A blank field causes the matching of variable names and values to get out of sync.)
- Character values can't contain embedded blanks (or other delimiters).
- Although the width of a field can be greater than eight columns, both character and numeric variables have a default length of 8 bytes. Character values that are longer than eight characters are truncated.
- Data must be in standard character or numeric format.
20.2 - Issues with Reading Fixed-Field Data
20.2 - Issues with Reading Fixed-Field DataFor the most part, using column and formatted input to read in data values that are arranged in neatly defined columns is straightforward. There are though — as always — a few sticking points that we'll address in this section. First, we'll learn not to be surprised when the lengths of our numeric variables do not coincide with the widths of the numeric informats we use in our INPUT statement. Then, we'll investigate what to do when trying to read in fixed-field data from a raw data file that contains records of different lengths.
Lengths of Variables
You might recall that the descriptor portion of a data set contains information about the attributes of each variable in your data set. The length of each variable in your data sets is one such attribute. When you use informats to read in some of your data values, it pays to know how SAS defines the lengths of your variables when it processes your DATA step. Let's make our discussion of this concrete by taking a look at an example.
Example 20.4
The following program is identical to the program in Example 20.2, except the PRINT procedure has been replaced with the CONTENTS procedure so that we can explore how SAS defines the lengths of variables:
DATA temp;
input @1 subj 4.
@6 f_name $11.
@18 l_name $6.
+3 height 2.
+5 wt_date mmddyy8.
+1 calorie comma5.;
format wt_date mmddyy8. calorie comma5.;
DATALINES;
1024 Alice Smith 1 65 125 12/1/95 2,036
1167 Maryann White 1 68 140 12/01/95 1,800
1168 Thomas Jones 2 190 12/2/95 2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia Ho 1 63 115 1/1/96 1,972
;
RUN;
PROC CONTENTS data = temp;
title 'Contents of TEMP Data Set';
RUN;
Contents of TEMP Data Set
Data Set Name | WORK.TEMP | Observations | 5 |
---|---|---|---|
Member Type | DATA | Variables | 6 |
Engine | V9 | Indexes | 0 |
Created | 2023/09/12 16:10:42 | Observation Length | 56 |
Last Modified | 2023/09/12 16:10:42 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | WINDOWS_64 | ||
Encoding | shift-jis Japanese (SJIS) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 1167 |
Obs in First Data Page | 5 |
Number of Data Set Repairs | 0 |
ExtendObsCounter | YES |
Filename | C:\Users\klc137\AppData\Local\Temp\SAS Temporary Files\_TD1668_E1-055551_\temp.sas7bdat |
Release Created | 9.0401M5 |
Host Created | X64_10PRO |
Owner Name | PSU\klc137 |
File Size | 128KB |
File Size (bytes) | 131072 |
Alphabetic List of Variables and Attributes | ||||
---|---|---|---|---|
# | Variable | Type | Len | Format |
6 | calorie | Num | 8 | COMMA5. |
2 | f_name | Char | 11 | |
4 | height | Num | 8 | |
3 | l_name | Char | 6 | |
1 | subj | Num | 8 | |
5 | wt_date | Num | 8 | MMDDYY8. |
Launch and run the SAS program, and when reviewing the output of the CONTENTS procedure, direct your attention to this portion:
# | Variable | Type | Len | Format |
---|---|---|---|---|
6 | calorie | num | 8 | COMMA5. |
2 | f_name | Char | 11 | |
4 | height | Num | 8 | |
3 | l_name | Char | 6 | |
1 | subj | Num | 8 | |
5 | wt_date | Nu | 8 | MMDDYY8. |
Let's see what we can learn from this ... SAS defined the length of the character variables — f_name and l_name — as 11 and 6, respectively. Hmm ... is it just a coincidence that the informats we used in the INPUT statement for these two variables were $11.? and $6., respectively? Of course not! In general, during the compile phase, SAS defines the lengths of the character variables in the program data vector to be the same as the field width w specified by the $w. character informat.
Can the same generalization be made for the numeric variables, such as subj, height, wt_date, and calorie? Apparently not! For example, we used the numeric informat of 4. to read in the subj variable, and yet SAS assigned the subj variable a length of 8 bytes. Similarly, we used the comma5. informat to read in the calorie variable, and yet SAS assigned the calorie variable a length of, not 5, but 8 bytes. In general, by default, SAS stores a numeric value— no matter how many digits the value contains — as a floating-point number in 8 bytes of storage. That is, the length of a stored numeric variable is not affected by an informat's width nor by any other column specifications in an INPUT statement. Unfortunately, that doesn't mean you can be lazy and use a default width of 8 for all of your numeric informats. If you do, you'll quickly see that your data values are not read properly.
Example 20.5
The following program is similar to the program in the previous example, except all of the input pointer controls have been changed to @n absolute pointer controls and the numeric informat for the height variable has been changed from 2. to 8.:
DATA temp;
input @1 subj 4.
@6 f_name $11.
@18 l_name $6.
@30 height 8.
@34 wt_date mmddyy8.
@43 calorie comma5.;
format wt_date mmddyy8. calorie comma5.;
DATALINES;
1024 Alice Smith 1 65 125 12/1/95 2,036
1167 Maryann White 1 68 140 12/01/95 1,800
1168 Thomas Jones 2 190 12/2/95 2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia Ho 1 63 115 1/1/96 1,972
;
RUN;
PROC PRINT data = temp;
title 'The TEMP Data Set';
RUN;
Obs | subj | f_name | l_name | height | wt_date | calorie |
---|---|---|---|---|---|---|
1 | 1024 | Alice | Smith | . | 12/01/95 | 2,036 |
2 | 1167 | Maryann | White | . | 12/01/95 | 1,800 |
3 | 1168 | Thomas | Jones | . | 12/02/95 | 2,302 |
4 | 1201 | Benedictine | Arnold | . | 11/30/95 | 2,432 |
5 | 1302 | Felicia | Ho | . | 01/01/96 | 1,972 |
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that SAS had trouble reading the data values for the height variable. If you take a look at the input data, you can see why. If you start in column 30, where the height values begin, and count eight columns to the right, you should see that SAS encounters two blank spaces — the one between the height and weight values and the one between the weight and date values. Because a blank space is a character value, and the height variable is defined as numeric, SAS gets confused, and displays an Invalid data message in the log window:
NOTE: Invalid data for height in line 299 30-37.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1024 Alice Smith 1 65 125 12/1/95 2,036
subj=1024 f_name=Alice l_name=Smith height=. wt_date=12/01/95 calorie=2,036 _ERROR_=1 _N_=1
NOTE: Invalid data for height in line 300 30-37.
1167 Maryann White 1 68 140 12/01/95 1,800
subj=1167 f_name=Maryann l_name=White height=. wt_date=12/01/95 calorie=1,800 _ERROR_=1 _N_=2
NOTE: Invalid data for height in line 301 30-37.
1168 Thomas Jones 2 190 12/2/95 2,302
subj=1168 f_name=Thomas l_name=Jones height=. wt_date=12/02/95 calorie=2,302 _ERROR_=1 _N_=3
NOTE: Invalid data for height in line 302 30-37.
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
subj=1201 f_name=Benedictine l_name=Arnold height=. wt_date=11/30/95 calorie=2,432 _ERROR_=1 _N_=4
NOTE: Invalid data for height in line 303 30-37.
1302 Felicia Ho 1 63 115 1/1/96 1,972
subj=1302 f_name=Felicia l_name=Ho height=. wt_date=01/01/96 calorie=1,972 _ERROR_=1 _N_=5
NOTE: The data set WORK.TEMP has 5 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
and assigns missing values to the height variable. Because we used @n absolute pointer controls to tell SAS where to begin reading the wt_date and calorie values, SAS has no trouble reading in the data for those variables.
In a nutshell, the moral of the story of the two previous examples is that you need to make sure that you define the widths of your character and numeric informats to be the number of columns the data values occupy in the input data, not the length of their resulting variables.
Reading Variable-Length Records
When you are using column or formatted input, you might run into trouble when the records in your input data file are of different lengths. Data files that have an end-of-record marker after a fixed, predetermined number of columns are called fixed-length record data files. This data file:
is a fixed-length record data file, because each record ends in the same column. Reading this data file using column input is a straightforward exercise.
Data files that have an end-of-record marker after the last field in each record are called variable-length record data files. This data file:
is a variable-length record data file, because the length of the records varies. Reading this data file using column input requires using the INFILE statement's PAD option or MISSOVER option to tell SAS how to behave when it reaches an end-of-record marker before all of the variables in a record have been read. Let's investigate this more thoroughly.
First, an important administrative comment! So far, we've been able to get away with illustrating all of the above SAS features by reading the data instream using a DATALINES statement. Since the examples that follow concern options that are available for the INFILE statement, we can't be lazy anymore. For each example, you'll have to first right-click on the link to the raw ASCII data file to save it to a convenient location on your computer, and then edit the program's INFILE statement to reflect that location. When all is said and done, you'll appreciate even more the lazy learner's way!
Example 20.6
The following program attempts to read the addresses.dat raw data file:
using formatted input:
DATA temp;
infile 'C:\yourdrivename\Stat481WC\01importingI\sasndata\addresses.dat';
input @1 subj 4. name $ 6-23 street $ 27-45;
RUN;
PROC PRINT;
title 'Temp data set';
RUN;
First, click on the link to the addresses.dat data file in order to save the file to a convenient location on your computer. Then, upon launching the SAS program, edit the INFILE statement so it reflects the location in which you saved the file. Finally, run the SAS program, and review the output. You should see that something has clearly gone awry:
Obs | subj | name | street |
---|---|---|---|
1 | 1024 | Alice Smith | 1167 Maryann White |
2 | 1168 | 1201 Benedict A | 1302 Felicia Ho |
If you take a look at the first record in the data file, you can see how things go downhill from the start. In trying to read the street address of subj 1024, SAS encounters an end-of-record marker before getting to the end of the field at column 45. Therefore, SAS proceeds to do what it has been told to do by default, namely to advance to the next line to look for the data field. That's why it reads "1167 Maryann White" as the street address for subj 1024. Having reached the end of the INPUT statement, SAS then proceeds to the next line and begins to read the data for subj 1168. This time, SAS can't complete reading the columns for name or street before encountering the end-of-record marker. Again, SAS advances to the next line and reads "1201 Benedictine A" as the name for subj 1168. SAS tries to read columns 27-45 as the street address for subj 1168, but again encounters the end-of-record marker before getting to column 45. So, SAS advances to the next line to look for the data field, and in so doing reads "1302 Felicia Ho" as the street address. Having reached the end of the INPUT statement, SAS proceeds to the next line, only to discover that there are no more records to read. The DATA step stops, and we are left with a messed up SAS data set! The following two examples help clean up our mess.
Example 20.7
As you now know, by default, SAS goes to the next data line to read more data if SAS has reached the end of a data line and there are still more variables in the INPUT statement that have not been assigned values. The following program uses the INFILE statement's MISSOVER option to tell SAS not to advance to the next data line to read more values to complete observation, but rather to assign missing values to the remaining variables:
DATA temp;
infile 'C:\yourdrivename\Stat481WC\01importingI\sasndata\addresses.dat' MISSOVER;
input @1 subj 4. name $ 6-23 street $ 27-45;
RUN;
PROC PRINT;
title 'Temp data set';
RUN;
If you haven't already done so, click on the link to the addresses.dat data file in order to save the file to a convenient location on your computer. And, upon launching the SAS program, edit the INFILE statement so it reflects the location in which you saved the file. Then, run the SAS program, and review the output. You should see that our resulting data set is much improved:
Obj | subj | name | street |
---|---|---|---|
1 | 1024 | Alice Smith | |
2 | 1167 | Maryann White | 756 Washington Ave. |
3 | 1168 | Thomas Jones | |
4 | 1201 | Benedictine Arnold | |
5 | 1302 | Felicia Ho | 110 Jefferson Blvd. |
At least this time, SAS associates the data values that it does read in with the correct variables. It's still not a perfect solution though, is it? In some cases, you might instead want SAS to read in the portions of the data fields that are available, such as 123 Elm St. for the street address for subj 1024, rather than automatically assigning a missing value when an end-of-record marker is encountered. That's where the INFILE statement's PAD option proves useful.
Example 20.8
The following program uses the INFILE statement's PAD option to tell SAS to pad each record in the data file with blanks so that all of the data lines have the same length:
DATA temp;
infile 'C:\yourdrivename\Stat481WC\01importingI\sasndata\addresses.dat' PAD;
input @1 subj 4. name $ 6-23 street $ 27-45;
RUN;
PROC PRINT;
title 'Temp data set';
RUN;
Again, if you haven't already done so, click on the link to the addresses.dat data file in order to save the file to a convenient location on your computer. And, upon launching the SAS program, edit the INFILE statement so it reflects the location in which you saved the file. Then, run the SAS program, and review the output. You should see that we have achieved the most complete and correct data set yet:
Obs | subj | name | street |
---|---|---|---|
1 | 1024 | Alice Smith | 123 Elm St. |
2 | 1167 | Maryann White | 756 Washington Ave. |
3 | 1168 | Thomas Jones | |
4 | 1201 | Benedictine Arnold | 245 Ferguson Ave. |
5 | 1302 | Felicia Ho | 110 Jefferson Blvd. |
20.3 - Issues with Reading Free-Format Data
20.3 - Issues with Reading Free-Format DataJust as you might encounter situations in which reading data using column and formatted input is not a straightforward exercise, so is the case with list input. In this section, we'll investigate two such situations. First, we'll address what you need to do when your input data file contains some missing values. Then, we'll investigate using the LENGTH statement as a way to tell SAS to set the length of character values to be something other than the default 8 characters. Let's jump right in again!
Reading Missing Values Using List Input
As mentioned in our review of list input, whenever our data contain missing values, we have to make sure that there is a placeholder in place of the missing values ... a period or some other character. If not, when using list input, the blank field causes the matching of variable names and values to get out of sync. You can imagine a situation in which you have a large data file ... let's say 5,000 records and 100 variables ... with no placeholders in place for the data values that are missing. You certainly wouldn't want to have to enter missing value placeholders into the data file just so you could read the data file into a SAS data set. Fortunately, there are ways around this problem! You've already seen one way in the last section, namely the use of the INFILE statement's MISSOVER option. That option will serve you well when using list input too, providing all of your missing values are at the end of a record. The other way, namely the use of the INFILE statement's DSD option, will serve you well when your missing values are at the beginning or middle of a record.
Example 20.9
The following program uses list input and the INFILE statement's MISSOVER option to read in the number of books five children read each week in a library's summer reading program when some of the values are missing at the end of a record:
DATA reading;
infile DATALINES MISSOVER;
input Name $ Week1-Week5;
DATALINES;
Robin 3 2 5 1 6
Jack 4 4 4 3 4
Tim 3 0 0
Martin 1 0 1 1
Caroline 2 3 4 5 6
RUN;
PROC PRINT data = reading;
title 'Summer Reading Program';
var Name Week1-Week5;
RUN;
Obs | Name | Week1 | Week2 | Week3 | Week4 | Week5 |
---|---|---|---|---|---|---|
1 | Robin | 3 | 2 | 5 | 1 | 6 |
2 | Jack | 4 | 4 | 4 | 3 | 4 |
3 | Tim | 3 | 0 | 0 | . | . |
4 | Martin | 1 | 0 | 1 | 1 | . |
5 | Caroline | 2 | 3 | 4 | 5 | 6 |
First, review the data to note that Tim has no values recorded for the fourth and fifth week and Martin has no values recorded for the fifth week. Also, note that just so you don't have to go through the rigamarole of downloading a data file and editing the INFILE statement, we are back to using the DATALINES statement to read the data instream. But, in order to still be able to take advantage of the INFILE statement's options, we have to tell SAS to expect the data after the DATALINES statement. That's why the:
infile DATALINES ...;
statement.
We need to use the INFILE statement just so we can take advantage of the statement's MISSOVER option. Just as is the case for column and formatted input, the MISSOVER option tells SAS not to advance to the next record when it doesn't find values in the current line for all of the INPUT statement variables. At the end of the current record, values that are expected but not found are set to missing.
One more thing to note is the range of numeric variables:
Week1-Week5
that appears in the INPUT statement. When the variable values in the input raw data file are sequential and separated by a blank (or some other such delimiter), you can specify a range of variables in the INPUT statement, as we did in the above program. This is especially useful if your data contains similar variables. For example, if your data contain the answers to a questionnaire, you might want to call your variables q1, q2, q3, ... . Note that you can also specify a range of variables in the PRINT procedure's VAR statement.
Finally, launch and run the SAS program, and review the output to convince yourself that the data are read in properly.
It is important to keep in mind that the MISSOVER option works only for missing values that occur at the end of the record. If your data contains missing values at the beginning or middle of a record, then you need instead to take advantage of the INFILE statement's DSD option.
Example 20.10
The following program uses list input and the INFILE statement's DSD option to read in the results of a survey of five people when some of the values are missing in the middle of a record:
DATA survey;
infile DATALINES DLM=',' DSD;
input Name $ (Q1-Q5) ($);
DATALINES;
Robert,,A,C,A,D
William,B,C,A,D,A
Linda,C,B,,A,C
Lisa,D,D,D,C,A
Katherine,A,B,C,D,A
RUN;
PROC PRINT data = survey;
title 'Survey Results';
var Name Q1-Q5;
RUN;
Obs | Name | Q1 | Q2 | Q3 | Q4 | Q5 |
---|---|---|---|---|---|---|
1 | Robert | A | C | A | D | |
2 | William | B | C | A | D | A |
3 | Linda | C | B | A | C | |
4 | Lisa | D | D | D | C | A |
5 | Katherin | A | B | C | D | A |
First, review the data to note that Robert and Linda each have one missing value in the middle of their record. Also, note that the data values are delimited with commas now rather than blanks. That's why the:
DLM=','
option appears in the INFILE statement. You might recall from Stat 480 that DLM stands for delimiter. Then, you should notice the range of variables:
(Q1-Q5) ($)
that appears in the INPUT statement. We've used the proper syntax for specifying a range of character variables, namely both the variable list and the $ sign are enclosed in parentheses.
Hmmm ... what else? Oh yeah ... the whole point of this example ... the DSD option that appears in the INFILE statement. The DSD option changes how SAS treats delimiters when list input is used. Specifically, the DSD option:
- sets the default delimiter to a comma
- treats two consecutive delimiters as a missing value
- removes quotation marks from values
That first point clearly makes our inclusion of the DLM=',' option superfluous, but that's okay. It's the second point that is the most relevant and the one that makes our program here work. Let's check it out! Launch and run the SAS program, and review the output to convince yourself that the data are read in properly. Oops ... not quite! The name Katherine contains nine characters. Because SAS defines the length of character variables to be 8 bytes by default, SAS truncates her name and stores it as Katherin without the ending e. We'll learn how to rectify that problem soon, but first one more example of the DSD option.
Example 20.11
The following program illustrates that the DSD option can also be used i) when there is a missing value at the beginning of a record, and ii) when the data are delimited by blanks (in conjunction with the DLM= option):
DATA survey;
infile DATALINES DLM=' ' DSD;
input Name $ (Q1-Q5) ($);
DATALINES;
Robert A C A D
William B C A D A
Linda C B A C
D D D C A
Katherine A B C D A
RUN;
PROC PRINT data = survey;
title 'Survey Results';
var Name Q1-Q5;
RUN;
Obs | Name | Q1 | Q2 | Q3 | Q4 | Q5 |
---|---|---|---|---|---|---|
1 | Robert | A | C | A | D | |
2 | William | B | C | A | D | A |
3 | Linda | C | B | A | C | |
4 | D | D | D | C | A | |
5 | Katherin | A | B | C | D | A |
The only difference between this program and the previous one is that the comma delimiters have been replaced with blanks, and the first value (Lisa) in the fourth record has been deleted so that our data set contains a missing value at the beginning of a record. Because the data are now delimited with blanks we use the DLM = ' ' option. Launch and run the SAS program, and review the output to convince yourself that the data are read in properly. Oops ... let's go fix that problem with Katherine now.
Modifying the Length of Character Values
As mentioned previously, when you use list input to read raw data, character values are assigned a default length of 8. When we ran the last two programs, that's why Katherine was truncated to Katherin. The LENGTH statement will help us resolve this problem.
Example 20.12
The following program is identical to the previous program, except a LENGTH statement has been added to tell SAS to define the length of the character variable Name to be 9 rather than the default 8:
DATA survey;
infile DATALINES DLM=' ' DSD;
length Name $ 9;
input Name $ (Q1-Q5) ($);
DATALINES;
Robert A C A D
William B C A D A
Linda C B A C
D D D C A
Katherine A B C D A
RUN;
PROC PRINT data = survey;
title 'Survey Results';
var Name Q1-Q5;
RUN;
Obs | Name | Q1 | Q2 | Q3 | Q4 | Q5 |
---|---|---|---|---|---|---|
1 | Robert | A | C | A | D | |
2 | William | B | C | A | D | A |
3 | Linda | C | B | A | C | |
4 | D | D | D | C | A | |
5 | Katherine | A | B | C | D | A |
Note that the LENGTH statement was intentionally placed before the INPUT statement. That's because variable attributes are defined when the variable is first encountered in the DATA step. If we placed the LENGTH statement after the INPUT statement, SAS would therefore by default define the length of Name to be 8 characters, not the desired 9. As you can see by the inclusion of the dollar sign ($), the LENGTH statement also tells SAS that Name should be treated as a character variable.
Now, go ahead and launch and run the SAS program, and review the output to convince yourself that all of the data are read in properly this time!
20.4 - Modifying List Input
20.4 - Modifying List InputList input can be made even more versatile by using what is called modified list input. Modified list input entails using either the ampersand (&) modifier or the colon (:) modifier:
- The ampersand (&) modifier allows you to read character values that contain embedded blanks.
- The colon (:) modifier allows you to read nonstandard data values and character values that are longer than eight characters, but which have no embedded blanks.
Let's take a look at an example in which modified list input would be useful.
Example 20.13
The following program attempts to use list input to read the populations of the ten most populous cities in the United States into a temporary SAS data set called citypops, but the program fails:
DATA reading;
DATA citypops;
infile DATALINES FIRSTOBS = 2;
input city pop2000;
DATALINES;
City Yr2000Popn
New York 8,008,278
Los Angeles 3,694,820
Chicago 2,896,016
Houston 1,953,631
Philadelphia 1,517,550
Phoenix 1,321,045
San Antonio 1,144,646
San Diego 1,223,400
Dallas 1,188,580
San Jose 894,943
;
RUN;
PROC PRINT data = citypops;
title 'The citypops data set';
RUN;
Obs | city | pop2000 |
---|---|---|
1 | . | . |
2 | . | . |
3 | . | . |
4 | . | . |
5 | . | . |
6 | . | . |
7 | . | . |
8 | . | . |
9 | . | . |
10 | . | . |
In reviewing the data, the first thing you might notice is that this particular input data file contains a header row:
City Yr2000Popn
that reports the content of each record. You may often find yourself in a situation in which someone has handed you such a data file, that is, one containing headings in addition to the columns of data. In general, that's a good thing, since then you know for sure what each record contains. It creates a problem though for reading in the data unless you tell SAS to disregard the heading information. That's just what the FIRSTOBS = 2 option in the INFILE statement tells SAS to do. It tells SAS to begin reading data at line 2 instead of the default line 1.
Moving past the header row, you should note the important features of the data. The longest city name is 12 characters. Some of the cities — New York, for example — contain embedded blanks. There are two blank spaces between the city names and their populations. Finally, because the population values contain commas, they are nonstandard values that require an informat during input. Given this list of features, it shouldn't be surprising that the standard list input style used in the INPUT statement fails.
Launch and run the SAS program, and review the output to convince yourself that SAS encounters a serious problem when attempting to read the data into the citypops data set.
The Ampersand (&) Modifier
Because the ampersand (&) modifier allows us to use list input to read character values containing single embedded blanks, it is the tool that we will want to use to read in the city names.
Example 20.14
The following program uses list input modified with an ampersand (&) to read in the city and the population values of the ten most populous cities in the United States in the year 2000:
DATA citypops;
infile DATALINES FIRSTOBS = 2;
length city $ 12;
input city & pop2000;
DATALINES;
City Yr2000Popn
New York 8008278
Los Angeles 3694820
Chicago 2896016
Houston 1953631
Philadelphia 1517550
Phoenix 1321045
San Antonio 1144646
San Diego 1223400
Dallas 1188580
San Jose 894943
;
RUN;
PROC PRINT data = citypops;
title 'The citypops data set';
format pop2000 comma10.;
RUN;
Obs | city | pop2000 |
---|---|---|
1 | New York | 8,008,278 |
2 | Los Angeles | 3,694,820 |
3 | Chicago | 2,896,016 |
4 | Houston | 1,953,631 |
5 | Philadelphia | 1,517,550 |
6 | Phoenix | 1,321,045 |
7 | San Antonio | 1,144,646 |
8 | San Diego | 1,223,400 |
9 | Dallas | 1,188,580 |
10 | San Jose | 894,943 |
Comparing this program to the previous program you should note four differences:
- The LENGTH statement tells SAS, in the compile phase, to define the city variable as a character variable, and to expect the city names to be as long as 12 characters.
- The ampersand (&) that follows the city variable in the INPUT statement tells SAS that the city values may contain one or more single embedded blanks. Because the ampersand modifier is used, SAS will read the city value until two or more consecutive blanks are encountered. That is a very important point ... when you use ampersand modified list input, the values that you are reading in must be separated by two or more consecutive blanks. You cannot use any other delimiter to indicate the end of each field.
- The commas have been removed from the population values so that SAS can read in the population values using unmodified (standard) list input for the pop2000 variable.
- A FORMAT statement has been added to the PRINT procedure just so that the pop2000 values are displayed with commas.
Launch and run the SAS program, and review the output to convince yourself that the values for both the city and pop2000 variables are read in properly.
Example 20.15
Rather than using a LENGTH statement to define the type and length of the city variable, we can place a $w. character informat right in the INPUT statement. The only difference between the following program and the previous one is that the LENGTH statement has been removed, and the $12. character informat has been inserted into the INPUT statement immediately following the city variable's ampersand (&) modifier:
DATA citypops;
infile DATALINES FIRSTOBS = 2;
input city & $12. pop2000;
DATALINES;
City Yr2000Popn
New York 8008278
Los Angeles 3694820
Chicago 2896016
Houston 1953631
Philadelphia 1517550
Phoenix 1321045
San Antonio 1144646
San Diego 1223400
Dallas 1188580
San Jose 894943
;
RUN;
PROC PRINT data = citypops;
title 'The citypops data set';
format pop2000 comma10.;
RUN;
Obs | city | pop2000 |
---|---|---|
1 | New York | 8,008,278 |
2 | Los Angeles | 3,694,820 |
3 | Chicago | 2,896,016 |
4 | Houston | 1,953,631 |
5 | Philadelphia | 1,517,550 |
6 | Phoenix | 1,321,045 |
7 | San Antonio | 1,144,646 |
8 | San Diego | 1,223,400 |
9 | Dallas | 1,188,580 |
10 | San Jose | 894,943 |
Launch and run the SAS program, and review the output to convince yourself that the values for both the city and pop2000 variables are again read in properly.
The Colon (:) Modifier
The colon (:) modifier allows us to use list input to read nonstandard data values and character values that are longer than eight characters, but which contain no embedded blanks. The colon (:) indicates that values are read until a blank (or other delimiters) is encountered, and then an informat is applied. If an informat for reading character values is specified, the w value specifies the variable's length, overriding the default length of 8.
Example 20.1
The following program uses the colon (:) modifier to tell SAS to expect commas when reading in the values for the pop2000 variable:
DATA citypops;
infile DATALINES FIRSTOBS = 2;
input city & $12. pop2000 : comma.;
DATALINES;
City Yr2000Popn
New York 8,008,278
Los Angeles 3,694,820
Chicago 2,896,016
Houston 1,953,631
Philadelphia 1,517,550
Phoenix 1,321,045
San Antonio 1,144,646
San Diego 1,223,400
Dallas 1,188,580
San Jose 894,943
;
RUN;
PROC PRINT data = citypops;
title 'The citypops data set';
format pop2000 comma10.;
RUN;
Obs | city | pop2000 |
---|---|---|
1 | New York | 8,008,278 |
2 | Los Angeles | 3,694,820 |
3 | Chicago | 2,896,016 |
4 | Houston | 1,953,631 |
5 | Philadelphia | 1,517,550 |
6 | Phoenix | 1,321,045 |
7 | San Antonio | 1,144,646 |
8 | San Diego | 1,223,400 |
9 | Dallas | 1,188,580 |
10 | San Jose | 894,943 |
Comparing this program to the previous program you should note just two differences:
- The commas have been added back into the population values so that we can see how to use the colon (:) modifier to read in nonstandard data values while still using list input.
- The colon (:) and comma. informat that follows the pop2000 variable in the INPUT statement tells SAS to expect population values to contain nonstandard characters — commas, in this particular instance. As illustrated here, we need not specify a w value when using the COMMAw.d informat here. That's because list input just reads each value until a blank is detected. (This differs from using a numeric informat with formatted input, in which we must specify a w value in order to tell SAS how many columns to read.)
Launch and run the SAS program, and review the output to convince yourself that the values for both the city and pop2000 variables are again read in properly.
Comparing Formatted Input and Modified List Input
It is important to keep in mind that informats work differently in modified list input than they do in formatted input. So, let's emphasize the point! With formatted input, the informat determines both the length of character variables and the number of columns that are read. The same number of columns are read from each record. For example, the following INPUT statement using formatted input:
input @1 City $12. @15 Pop2000 comma10.;
uses the $12. character informat to tell SAS to set the length of the city variable to 12 as well as to read columns 1 to 12 when reading in these data values:
City | Pop2000 |
---|---|
New York | 8,008,278 |
Los Angeles | 3,694,820 |
Chicago | 2,896,016 |
Huston | 1,953,631 |
Philadelphia | 1,517,550 |
The informat in the modified list input, on the other hand, determines only the length of the modified variable, not the number of columns that are read. Here:
input city & $12. pop2000 : comma.;
the raw data values are read until two consecutive blanks are encountered when reading in these data values suitable for list input:
City | Pop2000 |
---|---|
New York | 8,008,278 |
Los Angeles | 3,694,820 |
Chicago | 2,896,016 |
Huston | 1,953,631 |
Philadelphia | 1,517,550 |
20.5 - Mixing Input Styles
20.5 - Mixing Input StylesOf course, the first thing we always do when trying to read data from a raw data file into a SAS data set is to review the data to determine whether we need to use column, formatted, or list input to read in the data values. There are some situations, however, in which just one input method doesn't do the trick. In those cases, we'll want to consider mixing input styles, that is, using more than one input method simultaneously. Let's take a look at an example!
Example 20.17
The following program illustrates using column input, list input, and formatted input simultaneously to read in data concerning five U.S. national parks:
DATA nationalparks;
input ParkName $ 1-22 State $ Year @40 Acreage comma9.;
DATALINES;
Yellowstone ID/MT/WY 1872 4,065,493
Everglades FL 1934 1,398,800
Yosemite CA 1864 760,917
Great Smoky Mountains NC/TN 1926 520,269
Wolf Trap Farm VA 1966 130;
RUN;
PROC PRINT data = nationalparks;
format acreage comma9.;
RUN;
Obs | ParkName | State | Year | Acreage |
---|---|---|---|---|
1 | Yellowstone | ID/MT/WY | 1872 | 4,065,493 |
2 | Everglades | FL | 1934 | 1,398,800 |
3 | Yosemite | CA | 1864 | 760,917 |
4 | Great Smoky Mountains | NC/TN | 1926 | 520,269 |
5 | Wolf Trap Farm | VA | 1966 | 130 |
Reviewing the data, you can see that:
- Column input is an appropriate method for the first field (ParkName) because the values can be read as standard character values and are arranged in a neatly defined column.
- The next two fields are candidates for list input, as they are separated by a single blank while the first (State) contains standard character values and the second (Year) contains standard numeric values.
- The values in the last field (Acreage) are arranged in a neatly defined column, but the values require an informat. Therefore, formatted input is an appropriate method.
As you can see by the INPUT statement, mixing the input styles is easily achieved by simply using the different methods within the same INPUT statement. Launch and run the SAS program, and review the output to convince yourself that SAS read in the data properly.
20.6 - Summary
20.6 - SummaryIn this lesson, we extended our knowledge of how to read data from various types of data files into SAS data sets.
The homework for this lesson will give you more practice with these techniques so that you become even more familiar with how they work and can use them in your own SAS programming.