20.3 - Issues with Reading Free-Format Data

Just as you might encounter situations in which reading data using column and formatted input is not a straightforward exercise, so is the case with list input. In this section, we'll investigate two such situations. First, we'll address what you need to do when your input data file contains some missing values. Then, we'll investigate using the LENGTH statement as a way to tell SAS to set the length of character values to be something other than the default 8 characters. Let's jump right in again!

Reading Missing Values Using List Input Section

As mentioned in our review of list input, whenever our data contain missing values, we have to make sure that there is a placeholder in place of the missing values ... a period or some other character. If not, when using list input, the blank field causes the matching of variable names and values to get out of sync. You can imagine a situation in which you have a large data file ... let's say 5,000 records and 100 variables ... with no placeholders in place for the data values that are missing. You certainly wouldn't want to have to enter missing value placeholders into the data file just so you could read the data file into a SAS data set. Fortunately, there are ways around this problem! You've already seen one way in the last section, namely the use of the INFILE statement's MISSOVER option. That option will serve you well when using list input too, providing all of your missing values are at the end of a record. The other way, namely the use of the INFILE statement's DSD option, will serve you well when your missing values are at the beginning or middle of a record.

Example 20.9

The following program uses list input and the INFILE statement's MISSOVER option to read in the number of books five children read each week in a library's summer reading program when some of the values are missing at the end of a record:

DATA reading;
	infile DATALINES MISSOVER;
	input Name $ Week1-Week5;
	DATALINES;
Robin 3 2 5 1 6
Jack 4 4 4 3 4
Tim 3 0 0
Martin 1 0 1 1
Caroline 2 3 4 5 6
RUN;
 
PROC PRINT data = reading;
	title 'Summer Reading Program';
	var Name Week1-Week5;
RUN;

Summer Reading Program
ObsNameWeek1Week2Week3Week4Week5
1Robin32516
2Jack44434
3Tim300..
4Martin1011.
5Caroline23456

First, review the data to note that Tim has no values recorded for the fourth and fifth week and Martin has no values recorded for the fifth week. Also, note that just so you don't have to go through the rigamarole of downloading a data file and editing the INFILE statement, we are back to using the DATALINES statement to read the data instream. But, in order to still be able to take advantage of the INFILE statement's options, we have to tell SAS to expect the data after the DATALINES statement. That's why the:

infile DATALINES ...;

statement.

We need to use the INFILE statement just so we can take advantage of the statement's MISSOVER option. Just as is the case for column and formatted input, the MISSOVER option tells SAS not to advance to the next record when it doesn't find values in the current line for all of the INPUT statement variables. At the end of the current record, values that are expected but not found are set to missing.

One more thing to note is the range of numeric variables:

Week1-Week5

that appears in the INPUT statement. When the variable values in the input raw data file are sequential and separated by a blank (or some other such delimiter), you can specify a range of variables in the INPUT statement, as we did in the above program. This is especially useful if your data contains similar variables. For example, if your data contain the answers to a questionnaire, you might want to call your variables q1, q2, q3, ... . Note that you can also specify a range of variables in the PRINT procedure's VAR statement.

Finally, launch and run  the SAS program, and review the output to convince yourself that the data are read in properly.

It is important to keep in mind that the MISSOVER option works only for missing values that occur at the end of the record. If your data contains missing values at the beginning or middle of a record, then you need instead to take advantage of the INFILE statement's DSD option.

Example 20.10

 The following program uses list input and the INFILE statement's DSD option to read in the results of a survey of five people when some of the values are missing in the middle of a record:

DATA survey;
	infile DATALINES DLM=',' DSD;
	input Name $ (Q1-Q5) ($);
	DATALINES;
Robert,,A,C,A,D
William,B,C,A,D,A
Linda,C,B,,A,C
Lisa,D,D,D,C,A
Katherine,A,B,C,D,A
RUN;
 
PROC PRINT data = survey;
	title 'Survey Results';
	var Name Q1-Q5;
RUN;

Survey Results
ObsNameQ1Q2Q3Q4Q5
1Robert ACAD
2WilliamBCADA
3LindaCB AC
4LisaDDDCA
5KatherinABCDA

First, review the data to note that Robert and Linda each have one missing value in the middle of their record. Also, note that the data values are delimited with commas now rather than blanks. That's why the:

DLM=','

option appears in the INFILE statement. You might recall from Stat 480 that DLM stands for delimiter. Then, you should notice the range of variables:

(Q1-Q5) ($)

that appears in the INPUT statement. We've used the proper syntax for specifying a range of character variables, namely both the variable list and the $ sign are enclosed in parentheses.

Hmmm ... what else? Oh yeah ... the whole point of this example ... the DSD option that appears in the INFILE statement. The DSD option changes how SAS treats delimiters when list input is used. Specifically, the DSD option:

  • sets the default delimiter to a comma
  • treats two consecutive delimiters as a missing value
  • removes quotation marks from values

That first point clearly makes our inclusion of the DLM=',' option superfluous, but that's okay. It's the second point that is the most relevant and the one that makes our program here work. Let's check it out! Launch and run  the SAS program, and review the output to convince yourself that the data are read in properly. Oops ... not quite! The name Katherine contains nine characters. Because SAS defines the length of character variables to be 8 bytes by default, SAS truncates her name and stores it as Katherin without the ending e. We'll learn how to rectify that problem soon, but first one more example of the DSD option.

Example 20.11

The following program illustrates that the DSD option can also be used i) when there is a missing value at the beginning of a record, and ii) when the data are delimited by blanks (in conjunction with the DLM= option):

DATA survey;
	infile DATALINES DLM=' ' DSD;
	input Name $ (Q1-Q5) ($);
	DATALINES;
Robert  A C A D
William B C A D A
Linda C B  A C
 D D D C A
Katherine A B C D A
RUN;
PROC PRINT data = survey;
	title 'Survey Results';
	var Name Q1-Q5;
RUN;

Survey Results
ObsNameQ1Q2Q3Q4Q5
1Robert ACAD
2WilliamBCADA
3LindaCB AC
4 DDDCA
5KatherinABCDA

The only difference between this program and the previous one is that the comma delimiters have been replaced with blanks, and the first value (Lisa) in the fourth record has been deleted so that our data set contains a missing value at the beginning of a record. Because the data are now delimited with blanks we use the DLM = ' ' option. Launch and run  the SAS program, and review the output to convince yourself that the data are read in properly. Oops ... let's go fix that problem with Katherine now.

Modifying the Length of Character Values Section

As mentioned previously, when you use list input to read raw data, character values are assigned a default length of 8. When we ran the last two programs, that's why Katherine was truncated to Katherin. The LENGTH statement will help us resolve this problem.

Example 20.12

The following program is identical to the previous program, except a LENGTH statement has been added to tell SAS to define the length of the character variable Name to be 9 rather than the default 8:

DATA survey;
	infile DATALINES DLM=' ' DSD;
	length Name $ 9;
	input Name $ (Q1-Q5) ($);
	DATALINES;
Robert  A C A D
William B C A D A
Linda C B  A C
 D D D C A
Katherine A B C D A
RUN;
 
PROC PRINT data = survey;
	title 'Survey Results';
	var Name Q1-Q5;
RUN;

Survey Results
ObsNameQ1Q2Q3Q4Q5
1Robert ACAD
2WilliamBCADA
3LindaCB AC
4 DDDCA
5KatherineABCDA

Note that the LENGTH statement was intentionally placed before the INPUT statement. That's because variable attributes are defined when the variable is first encountered in the DATA step. If we placed the LENGTH statement after the INPUT statement, SAS would therefore by default define the length of Name to be 8 characters, not the desired 9. As you can see by the inclusion of the dollar sign ($), the LENGTH statement also tells SAS that Name should be treated as a character variable.

Now, go ahead and launch and run  the SAS program, and review the output to convince yourself that all of the data are read in properly this time!