Original Message re C Strings

This was written in 1989 and sent to various newsgroups and people..

----------

I  have  been  carefully watching  the C  Echo for  about 6  months, and

noticed that many of the messages  and mistakes  seem to  revolve around

STRING Handling.

I would like to see a very simple change made to C to solve many  of the

problems,  and  at  the  same  time,  increase  performance of  C string

handling between 6 and over 20 fold.  (Yes 20 is not a typo error!).

The  changes to  C are  quite minor,  and will  not affect  any programs

already written.  In addition, it will save the extra coding necessay to

compare strings.  For example, it would be wonderful to be able to code:

      if (variable=='some literal')  ...

instead of:

      if (strcmp(variable,"some literal")==0)  ...

The changes are:

1. Extend  the  syntax  for  single  quotes  to  allow  longer  than one

   character.  Thus statements such as:

            if (name=='literal')

      and

            name='literal string'

   become valid.

2. Extend the language syntax to allow a  three types  of string.  These

   would be:

      *  Fixed length strings similar to PL/I or PASCAL.

         PL/I  fixed  length  strings  differ  from PASCAL  Fixed length

         strings in one way  only -  when PL/I  compares a  fixed length

         string  with another  string, it  effectively pads  the shorter

         string with blanks so that a true comparison  takes place  in a

         IF statement.

         PASCAL and C on the other hand, use the  length of  the strings

         in a compare.

      *  Varying length strings.  Again, I would suggest two  main types

         - the C and PASCAL type, and the PL/I type.

         The  same  comments  about  comparisons  apply to  PL/I varying

         strings - the strings are EFFECTIVELY padded with blanks when a

         comparison is done.

   By simply adding the word 'var' or 'fixed' to a character definition,

   optionally followed by the word 'pli', the compiler would be  able to

   know the difference between the string types.

   Thus you could have the following definition:

          char string-name char[50] var pli;

   or

          char name char[100] var;   /* C  or PASCAL  style strings */

Next message contains compiler changes required.

Continued Message about a minor change to C

              WHY IS C STRING HANDLING SLOWER THAN IT CAN BE?

              -----------------------------------------------

                         The "C" String Problem

                         ======================

   Unlike PL/I, PASCAL, BASIC etc,  "C" does  not hold  the length  of a

varying string at the front of the string  - rather  a string  is termi-

nated  by  a binary  0 at  the end  of the  string.  Performance suffers

greatly because a search must be made for the binary  0 before  a string

can be copied or compared to or with another string.

   Because C uses a Binary 0 at the end of each string instead of keeping

the current length of VARYING strings at the front  of the  string, when

you  copy  a  string  to  another  string  (or  compare one  string with

another), C must either find out how much data to copy or compare, or it

must look for the binary 0 AS IT IS COPYING.

   In addition, "C" string handling is usually performed  with expensive

string functions - and functions with parameters  are expensive  to call

and error prone (for example, there  is nothing  in "C"  to stop  a long

string being copied to a short string, and destroying data  that follows

the short string).

   The  real  problem  is  that  "C"  does not  have any  inbuilt string

handling facilities, and this probably because the original  machine "C"

was designed for (the PDP-11) didn't have them either.

Many machines such as the IBM PC with 8086 chips or equivalent,  and IBM

mainframes like the 370 series of computers have instructions (or groups

of instructions) for moving and comparing data quickly.  For example:

        8086       REP MOVSW and REP MOVSB

        370        MVC, MVCL, CLC etc.

Even Z80's had instructions for  moving data  quickly, and  probably the

68000 also has similar instructions.

Therefore, it is most probable that MOST of the  computers in  the world

(using C) are UNDERUTILIZED because the C language does not have support

for strings BUILTIN.

                     OTHER PROBLEMS WITH C STRINGS.

                     ------------------------------

In  addition  to  the  EFFICENCY  problem  mentioned  above, "C"  has no

checking when strings are copied.  It is all too easy  to copy  too much

data  from  one  place  to  another,  and  overwrite  area   of  storage

accidently.

PL/I, PASCAL and the IBM 370 Assembler automatically  check if  too much

data is being moved, and truncates a string copy if necessary.

Continued Message about a minor change to C

Included in this group of messages is a program that you can run to

demonstrate the relative times it takes to copy one string to another.

The following is the result of that small program.

It shows the relative times  for various  methods of copying strings.

 __________________________________________________________________

|                                                                  |

|   Elapsed Time for DUMMY CPY (Loop Overhead)   2.00              |

|                                                                  |

|   Elapsed Time for ASM CPY is  7.00                              |

|                                                                  |

|   Elapsed Time for MEMCPY CPY is 12.00                           |

|                                                                  |

|   Elapsed Time for STRCPY CPY is 31.00                           |

|                                                                  |

|   Elapsed Time for INCRCPY (Slowest) CPY is 136.00               |

|                                                                  |

|   Elapsed Time for INCRCPY (using Register Vars) CPY is 67.00    |

|__________________________________________________________________|

|                                                                  |

|     Type of Copy               Elap    Loop     Real    Ratio    |

|                                Time    Overhd   Copy             |

|__________________________________________________________________|

|                                                                  |

| Copying a 66 byte string 300000 times                            |

| =====================================                            |

|                                                                  |

|   INCRCPY (Static   Pointers)  136.00   2.00  134.00   1.000000  |

|   INCRCPY (Register Pointers)   67.00   2.00   65.00   2.061538  |

|   STRCPY                        31.00   2.00   29.00   4.620690  |

|   MEMCPY                        12.00   2.00   10.00  13.400000  |

|   ASM                            7.00   2.00    5.00  26.800000  |

|                                                        ^         |

|                                                        |         |

|  Note -------------------------------------------------|         |

|__________________________________________________________________|

The next message mentions some C macros that can be  used until  C compilers

are made available with fast string handling builtin.

Continued Message about a minor change to C

Over many, many  months, I  have developed  some C  macros that  will copy

varying length and fixed length strings, and compare  them with  each other.

There are two main versions of these routines:

*  A generic set of macros that should work on  any C  compiler.  These give

   speed improvements of about 10 fold over the usual:

      while (*dest++=*src++);

*  A set of routines that should be usuable with any C compiler that has the

   capability to generate  ASM code  and call  in a  macro assembler.  These

   should give a speed improvement of about 20 fold on an 8086  machine, and

   probably similiar on a 370 type mainframe.

   TWENTY FOLD !

   ===========

   The C Macros and routines mentioned above:

*  Speed copying strings by a factor or 4 to over 20 times.

*  Speed comparing strings by a factor of 2 to over 18 times.

*  Add a generic CPY function for copying one  string to  another (fixed

   and varying length strings).

*  Add a generic  CMP function  for comparing  fixed and  varying length

   strings  with  each other,  and optionally  checking that  the longer

   strings  have  blanks  on  the  end  thus  providing  a  true  string

   comparision.

*  Provide an easier method of defining and using EXTERNAL variables.

*  Automatically truncate  long strings  when they  are copied  to short

   strings  so  that  storage  following  the  shorter  strings  is  not

   accidently overwritten.

*  Automatically pads longer strings with blanks when a short  string is

   copied to a long fixed length string.

                    _____ ______

*  Generic functions written totally in "C" macros can be used  with any

   "C" compiler to give approximately two times speed improvement.

   In addition, special macros  can be  used with  TURBOC and  QUICKC to

   provide a minimum of a three times improvement during the development

   phase.

   Finally, finely tuned Assembler macros can be used to  produce really

   fast  small  production programs  outside the  integrated environment

   (improvements over twenty times can be gained).

*  With a few minor modifications, the macros provided can be  used with

   IBM 370 style machines to enable the effective use of the CLC and MVC

   and the "long" equivalent instructions.

   

------------------------------

 

Yesterday, I sent a group of messages about the speed of string handling

in C, and a program to demonstrate how much faster string handling could

be.

As part of the group of messages, I tried to send sample output from the

program, but due to the line lengths being too long, it didn't all get

there!

Here is a sample of the output from the program.

 __________________________________________________________________

|                                                                  |

|   Elapsed Time for DUMMY CPY (Loop Overhead)   2.00              |

|                                                                  |

|   Elapsed Time for ASM CPY is  7.00                              |

|                                                                  |

|   Elapsed Time for MEMCPY CPY is 12.00                           |

|                                                                  |

|   Elapsed Time for STRCPY CPY is 31.00                           |

|                                                                  |

|   Elapsed Time for INCRCPY (Slowest) CPY is 136.00               |

|                                                                  |

|   Elapsed Time for INCRCPY (using Register Vars) CPY is 67.00    |

|__________________________________________________________________|

|                                                                  |

|     Type of Copy               Elap    Loop     Real    Ratio    |

|                                Time    Overhd   Copy             |

|__________________________________________________________________|

|                                                                  |

| Copying a 66 byte string 300000 times                            |

| =====================================                            |

|                                                                  |

|   INCRCPY (Static   Pointers)  136.00   2.00  134.00   1.000000  |

|   INCRCPY (Register Pointers)   67.00   2.00   65.00   2.061538  |

|   STRCPY                        31.00   2.00   29.00   4.620690  |

|   MEMCPY                        12.00   2.00   10.00  13.400000  |

|   ASM                            7.00   2.00    5.00  26.800000  |

|                                                        ^         |

|                                                        |         |

|  Note -------------------------------------------------|         |

|__________________________________________________________________|

The changes required to be made to C are simple, and will result

in much safer and faster programs.

Again, I'd like to hear your thoughts.

Regards

Clement Victor Clarke