4.5 - Converting Data

Suppose you are asked to calculate sales income using the price and the number of units sold. Pretty straightforward, eh? As long as price and units are stored in your data set as numeric variables, then you could just use the assignment statement:

sales = price * units;

It may be the case, however, that price and units are instead stored as character variables. Then, you can imagine it being a little odd trying to multiply price by units. In that case, the character variables price and units first need to be converted to numeric variables price and units. How SAS helps us do that is the subject of this section. To be specific, we'll learn how the INPUT function converts character values to numeric values.

The reality though is that SAS is a pretty smart application. If you try to do something to a character variable that should only be done to a numeric variable, SAS automatically tries first to convert the character variable to a numeric variable for you. The problem with taking this lazy person's approach is that it doesn't always work the way you'd hoped. That's why, by the end of our discussion, you'll appreciate that the moral of the story is that it is always best for you to perform the conversions yourself using the INPUT function.

Example 4.11 Section

The following SAS program illustrates how SAS tries to perform an automatic character-to-numeric conversion of standtest and e1, e2, e3, and e4 so that arithmetic operations can be performed on them:

DATA grades;
	input name $ 1-15 e1 $ e2 $ e3 $ e4 $ standtest $;
	avg = round(mean(e1,e2,e3,e4),1); 
	std = standtest/4;
	DATALINES;
Alexander Smith   78 82 86 69   1,210
John Simon        88 72 86  .     990
Patricia Jones    98 92 92 99   1,010
Jack Benedict     54 63 71 49     875
Rene Porter      100 62 88 74   1,180
;
RUN;

PROC PRINT data = grades;
RUN;

Okay, first note that for some crazy reason all of the data in the data set have been read in as character data. That is, even the exam scores (e1, e2, e3, e4) and the standardized test scores (standtest) are stored as character variables. Then, when SAS goes to calculate the average exam score (avg), SAS first attempts to convert e1, e2, e3, and e4 to numeric variables. Likewise, when SAS goes to calculate a new standardized test score (std), SAS first attempts to convert standtest to a numeric variable. Let's see how it does. Launch and run the SAS program, and before looking at the output window, take a look at the log window. You should see something that looks like this:


118 DATA grades;
119     input name $ 1-15 e1 $ e2 $ e3 $ e4 $ standtest  $;
120     avg = round(mean(e1,e2,e3,e4),1);
121     std = standtest/4;
122     DATALINES;

NOTE: Character values have been coverted to numeric values at the places given by:
      (Line):(Column).
      120:22    120:25    120:28    120:31    121:11
NOTE: Invalid numeric data, standtest='1,212' , at line 121 column 11
RULE:   ----+----1----+----2----+----3----+----4----+----5----+----6-----+----7----+----8---
123     Alexander Smith 78 82 86 69 1,210
name=Alexander Smith e1=78 e2=82 e3=86 e4=69 standtest=1,210 avg=79 std=. _ERROR_=1 _N_=1
NOTE: Invalid numeric data, standtest='1,012' , at line 121 column 11
125     Patricia Jones 98 92 92 99 1,010
name=Patricia jones e1=98 e2=92 e3=92 e4=99 standtest=1,010 avg=95 std=. _ERROR_=1 _N_=3
NOTE: Invalid numeric data, standtest='1,180' , at line 121 column 11
127     Rene Porter 100 62 88 74 1,180
name=Rene Porter e1=100 e2=62 e3=88 e4=74 standtest=1,180 avg=81 std=. _ERROR_=1 _N_=5
NOTE: Missing values were generated as a result of preforming an operation on missing values.
      Each place is givem by: (Number of times) at (Line):(Column).
      3 at 121:20
NOTE: The dataset WORK.GRADE has 5 observations and 8 variables.
NOTE: DATA statement used (Total Process time):
      real time            0.01 seconds
      cpu time             0.01 seconds

The first NOTE that you see is a standard message that SAS prints in the log to warn you that it performed an automatic character-to-numeric conversion on your behalf. Then, you see three NOTES about invalid numeric data concerning the standtest values 1,210, 1,010, and 1,180. In case you haven't figured it out yourself, it's the commas in those numbers that is throwing SAS for a loop. In general, the automatic conversion produces a numeric missing value from any character value that does not conform to standard numeric values (containing only digits 0, 1, ..., 9, a decimal point, and plus or minus signs). That's why that fifth NOTE is there about missing values being generated. The output itself:

Obs name e1 e2 e3 e4 standtest avg std
1 Alexander Smith 78 82 86 69 1,210 79 .
2 John Simon 88 72 86   990 82 247.50
3 Patricia Jones 98 92 92 99 1,010 95 .
4 Jack Benedict 54 63 71 49 875 59 218.75
5 Rene Porter 100 62 88 74 1,180 81 .

shows the end result of the attempted automatic conversion. The calculation of avg went off without a hitch because e1, e2, e3, and e4 contain standard numeric values, whereas the calculation of std did not because standtest contains nonstandard numeric values. Let's take this character-to-numeric conversion into our own hands.

Example 4.12 Section

The following SAS program illustrates the use of the INPUT function to convert the character variable standtest to a numeric variable explicitly so that an arithmetic operation can be performed on it:

DATA grades;
	input name $ 1-15 e1 $ e2 $ e3 $ e4 $ standtest $;
	std = input(standtest,comma5.)/4;
	DATALINES;
Alexander Smith   78 82 86 69   1,210
John Simon        88 72 86  .     990
Patricia Jones    98 92 92 99   1,010
Jack Benedict     54 63 71 49     875
Rene Porter      100 62 88 74   1,180
;
RUN;

PROC PRINT data = grades;
   var name standtest std;
RUN;

The only difference between the calculation of std here and of that in the previous example is that the standtest variable has been inserted here into the INPUT function. The general form of the INPUT function is:

INPUT(source, informat)

where:

  • source is the character variable, constant or expression to be converted to a numeric variable
  • informat is a numeric informat that allows you to read the values stored in source

In our case, standtest is the character variable we are trying to convert to a numeric variable. The values in standtest conform to the comma5. informat, and hence its specification in the INPUT function.

Let's see how we did. Launch and run the SAS program, and again before looking at the output window, take a look at the log window. You should see something that now looks like this:

Ahhaa! No warnings about SAS taking over our program and performing automatic conversions. That's because we are in control this time! Now, looking at the output:

Obs name standtest std
1 Alexander Smith 1,210 302.50
2 John Simon 990 247.50
3 Patricia Jones 1,010 252.50
4 Jack Benedict 875 218.75
5 Rene Porter 1,180 295.00

we see that we successfully calculated std this time around. That's much better!

A couple of closing comments. First, I might use our discussion here to add another item to your growing list of good programming practices. Whenever possible, make sure that you are the one who is in control of your program. That is, know what your program is doing at all times, and if it's not doing what you'd expect it to do for all situations, then rewrite it in such a way to make sure that it does.

Second, you might be wondering "geez, we just spent all this time talking about character-to-numeric conversions, but what happens if I have to do a numeric-to-character conversion instead?" Don't worry ... SAS doesn't let you down. If you try to do something to a character variable that should only be done to a numeric variable, SAS automatically tries first to convert the character variable to a numeric variable. If that doesn't work, then you'll want to use the PUT function to convert your numeric values to character values explicitly. We'll address the PUT function in Stat 481 when we learn about character functions in depth.