4.5 - Converting Data

Suppose you are asked to calculate sales income using the price and the number of units sold. Pretty straightforward, eh? As long as price and units are stored in your data set as numeric variables, then you could just use the assignment statement:

sales = price * units;

It may be the case, however, that price and units are instead stored as character variables. Then, you can imagine it being a little odd trying to multiply price by units. In that case, the character variables price and units first need to be converted to numeric variables price and units. How SAS helps us do that is the subject of this section. To be specific, we'll learn how the INPUT function converts character values to numeric values.

The reality though is that SAS is a pretty smart application. If you try to do something to a character variable that should only be done to a numeric variable, SAS automatically tries first to convert the character variable to a numeric variable for you. The problem with taking this lazy person's approach is that it doesn't always work the way you'd hoped. That's why, by the end of our discussion, you'll appreciate that the moral of the story is that it is always best for you to perform the conversions yourself using the INPUT function.

Example 4.11 Section

The following SAS program illustrates how SAS tries to perform an automatic character-to-numeric conversion of standtest and e1, e2, e3, and e4 so that arithmetic operations can be performed on them:

DATA grades;
	input name $ 1-15 e1 $ e2 $ e3 $ e4 $ standtest $;
	avg = round(mean(e1,e2,e3,e4),1); 
	std = standtest/4;
	DATALINES;
Alexander Smith   78 82 86 69   1,210
John Simon        88 72 86  .     990
Patricia Jones    98 92 92 99   1,010
Jack Benedict     54 63 71 49     875
Rene Porter      100 62 88 74   1,180
;
RUN;
PROC PRINT data = grades;
RUN;

Okay, first note that for some crazy reason, all of the data in the data set have been read in as character data. That is, even the exam scores (e1, e2, e3, e4) and the standardized test scores (standtest) are stored as character variables. Then, when SAS goes to calculate the average exam score (avg), SAS first attempts to convert e1, e2, e3, and e4 to numeric variables. Likewise, when SAS calculates a new standardized test score (std), SAS first attempts to convert standtest to a numeric variable. Let's see how it does. Launch and run  the SAS program, and before looking at the output window, take a look at the log window. You should see something that looks like this:

         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
             
             DATA grades;
             input name $ 1-15 e1 $ e2 $ e3 $ e4 $ standtest $;
             avg = round(mean(e1,e2,e3,e4),1);
             std = standtest/4;
             DATALINES;
     
    NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
           71:19   71:22   71:25   71:28   72:8    
    NOTE: Invalid numeric data, standtest='1,210' , at line 72 column 8.
    RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0                     
             Alexander Smith   78 82 86 69   1,210
    name=Alexander Smith e1=78 e2=82 e3=86 e4=69 standtest=1,210 avg=79 std=. _ERROR_=1 _N_=1
    NOTE: Invalid numeric data, standtest='1,010' , at line 72 column 8.
             Patricia Jones    98 92 92 99   1,010
    name=Patricia Jones e1=98 e2=92 e3=92 e4=99 standtest=1,010 avg=95 std=. _ERROR_=1 _N_=3
    NOTE: Invalid numeric data, standtest='1,180' , at line 72 column 8.
             Rene Porter      100 62 88 74   1,180
    name=Rene Porter e1=100 e2=62 e3=88 e4=74 standtest=1,180 avg=81 std=. _ERROR_=1 _N_=5
    NOTE: Missing values were generated as a result of performing an operation on missing values.
         Each place is given by: (Number of times) at (Line):(Column).
         3 at 72:17   
    NOTE: The data set WORK.GRADES has 5 observations and 8 variables.
    NOTE: DATA statement used (Total process time):
         real time           0.00 seconds
         user cpu time       0.00 seconds
         system cpu time     0.00 seconds
         memory              708.81k
         OS Memory           20388.00k
         Timestamp           05/15/2023 12:13:39 PM
         Step Count                        24  Switch Count  2
         Page Faults                       0
         Page Reclaims                     199
         Page Swaps                        0
         Voluntary Context Switches        10
         Involuntary Context Switches      0
         Block Input Operations            0
         Block Output Operations           264

The first NOTE that you see is a standard message that SAS prints in the log to warn you that it performed an automatic character-to-numeric conversion on your behalf. Then, you see three NOTES about invalid numeric data concerning the standtest values 1,210, 1,010, and 1,180. In case you haven't figured it out yourself, it's the commas in those numbers that is throwing SAS for a loop. In general, the automatic conversion produces a numeric missing value from any character value that does not conform to standard numeric values (containing only digits 0, 1, ..., 9, a decimal point, and plus or minus signs). That's why that fifth NOTE is there about missing values being generated. The output itself:

Obs

name

e1

e2

e3

e4

standtest

avg

std

1

Alexander Smith

78

82

86

69

1,210

79

.

2

John Simon

88

72

86

 

990

82

247.50

3

Patricia Jones

98

92

92

99

1,010

95

.

4

Jack Benedict

54

63

71

49

875

59

218.75

5

Rene Porter

100

62

88

74

1,180

81

.

shows the end result of the attempted automatic conversion. The calculation of avg went off without a hitch because e1, e2, e3, and e4 contain standard numeric values, whereas the calculation of std did not because standtest contains nonstandard numeric values. Let's take this character-to-numeric conversion into our own hands.

Example 4.12 Section

The following SAS program illustrates the use of the INPUT function to convert the character variable standtest to a numeric variable explicitly so that an arithmetic operation can be performed on it:

DATA grades;
	input name $ 1-15 e1 $ e2 $ e3 $ e4 $ standtest $;
	std = input(standtest,comma5.)/4;
	DATALINES;
Alexander Smith   78 82 86 69   1,210
John Simon        88 72 86  .     990
Patricia Jones    98 92 92 99   1,010
Jack Benedict     54 63 71 49     875
Rene Porter      100 62 88 74   1,180
;
RUN;

PROC PRINT data = grades;
   var name standtest std;
RUN;

The only difference between the calculation of std here and that in the previous example is that the standtest variable has been inserted here into the INPUT function. The general form of the INPUT function is:

INPUT(source, informat)

where:

  • source is the character variable, constant or expression to be converted to a numeric variable
  • informat is a numeric informat that allows you to read the values stored in source

In our case, standtest is the character variable we are trying to convert to a numeric variable. The values in standtest conform to the comma5. informat, and hence its specification in the INPUT function.

Let's see how we did. Launch and run  the SAS program, and again before looking at the output window, take a look at the log window. You should see something that now looks like this:

    DATA grades;
        input name $ 1-15 e1 $ e2 $ e3 $ e4 $ standtest $;
        std = input(standtest,comma5.)/4;
        DATALINES;

NOTE: The data set WORK.GRADES has 5 observations and 7 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.03 seconds


   ;
   RUN;

   PROC PRINT data = grades;
NOTE: Writing HTML Body file: sashtml.htm
      var name standtest std;
   RUN;

NOTE: There were 5 observations read from the data set WORK.GRADES.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.51 seconds
      cpu time            0.34 seconds

Ahhaa! No warnings about SAS taking over our program and performing automatic conversions. That's because we are in control this time! Now, looking at the output:

Obs

name

standtest

std

1

Alexander Smith

1,210

302.50

2

John Simon

990

247.50

3

Patricia Jones

1,010

252.50

4

Jack Benedict

875

218.75

5

Rene Porter

1,180

295.00

we see that we successfully calculated std this time around. That's much better!

A couple of closing comments. First, I might use our discussion here to add another item to your growing list of good programming practices. Whenever possible, make sure that you are the one who is in control of your program. That is, know what your program is doing at all times, and if it's not doing what you'd expect it to do for all situations, then rewrite it in such a way to make sure that it does.

Second, you might be wondering "geez, we just spent all this time talking about character-to-numeric conversions, but what happens if I have to do a numeric-to-character conversion instead?" Don't worry ... SAS doesn't let you down. If you try to do something to a character variable that should only be done to a numeric variable, SAS automatically tries first to convert the character variable to a numeric variable. If that doesn't work, then you'll want to use the PUT function to convert your numeric values to character values explicitly. We'll address the PUT function in Stat 481 when we learn about character functions in depth.