Original Message re C Strings

This was written in 1989 and sent to various newsgroups and people..
----------
I  have  been  carefully watching  the C  Echo for  about 6  months, and
noticed that many of the messages  and mistakes  seem to  revolve around
STRING Handling.

I would like to see a very simple change made to C to solve many  of the
problems,  and  at  the  same  time,  increase  performance of  C string
handling between 6 and over 20 fold.  (Yes 20 is not a typo error!).

The  changes to  C are  quite minor,  and will  not affect  any programs
already written.  In addition, it will save the extra coding necessay to
compare strings.  For example, it would be wonderful to be able to code:

      if (variable=='some literal')  ...

instead of:

      if (strcmp(variable,"some literal")==0)  ...

The changes are:

1. Extend  the  syntax  for  single  quotes  to  allow  longer  than one
   character.  Thus statements such as:

            if (name=='literal')
      and
            name='literal string'

   become valid.

2. Extend the language syntax to allow a  three types  of string.  These
   would be:

      *  Fixed length strings similar to PL/I or PASCAL.

         PL/I  fixed  length  strings  differ  from PASCAL  Fixed length
         strings in one way  only -  when PL/I  compares a  fixed length
         string  with another  string, it  effectively pads  the shorter
         string with blanks so that a true comparison  takes place  in a
         IF statement.

         PASCAL and C on the other hand, use the  length of  the strings
         in a compare.

      *  Varying length strings.  Again, I would suggest two  main types
         - the C and PASCAL type, and the PL/I type.

         The  same  comments  about  comparisons  apply to  PL/I varying
         strings - the strings are EFFECTIVELY padded with blanks when a
         comparison is done.

   By simply adding the word 'var' or 'fixed' to a character definition,
   optionally followed by the word 'pli', the compiler would be  able to
   know the difference between the string types.

   Thus you could have the following definition:

          char string-name char[50] var pli;
   or
          char name char[100] var;   /* C  or PASCAL  style strings */

Next message contains compiler changes required.
Continued Message about a minor change to C


              WHY IS C STRING HANDLING SLOWER THAN IT CAN BE?
              -----------------------------------------------

                         The "C" String Problem
                         ======================

   Unlike PL/I, PASCAL, BASIC etc,  "C" does  not hold  the length  of a
varying string at the front of the string  - rather  a string  is termi-
nated  by  a binary  0 at  the end  of the  string.  Performance suffers
greatly because a search must be made for the binary  0 before  a string
can be copied or compared to or with another string.
   Because C uses a Binary 0 at the end of each string instead of keeping
the current length of VARYING strings at the front  of the  string, when
you  copy  a  string  to  another  string  (or  compare one  string with
another), C must either find out how much data to copy or compare, or it
must look for the binary 0 AS IT IS COPYING.
   In addition, "C" string handling is usually performed  with expensive
string functions - and functions with parameters  are expensive  to call
and error prone (for example, there  is nothing  in "C"  to stop  a long
string being copied to a short string, and destroying data  that follows
the short string).
   The  real  problem  is  that  "C"  does not  have any  inbuilt string
handling facilities, and this probably because the original  machine "C"
was designed for (the PDP-11) didn't have them either.

Many machines such as the IBM PC with 8086 chips or equivalent,  and IBM
mainframes like the 370 series of computers have instructions (or groups
of instructions) for moving and comparing data quickly.  For example:

        8086       REP MOVSW and REP MOVSB
        370        MVC, MVCL, CLC etc.

Even Z80's had instructions for  moving data  quickly, and  probably the
68000 also has similar instructions.

Therefore, it is most probable that MOST of the  computers in  the world
(using C) are UNDERUTILIZED because the C language does not have support
for strings BUILTIN.



                     OTHER PROBLEMS WITH C STRINGS.
                     ------------------------------


In  addition  to  the  EFFICENCY  problem  mentioned  above, "C"  has no
checking when strings are copied.  It is all too easy  to copy  too much
data  from  one  place  to  another,  and  overwrite  area   of  storage
accidently.

PL/I, PASCAL and the IBM 370 Assembler automatically  check if  too much
data is being moved, and truncates a string copy if necessary.

Continued Message about a minor change to C

Included in this group of messages is a program that you can run to
demonstrate the relative times it takes to copy one string to another.


The following is the result of that small program.

It shows the relative times  for various  methods of copying strings.


 __________________________________________________________________
|                                                                  |
|   Elapsed Time for DUMMY CPY (Loop Overhead)   2.00              |
|                                                                  |
|   Elapsed Time for ASM CPY is  7.00                              |
|                                                                  |
|   Elapsed Time for MEMCPY CPY is 12.00                           |
|                                                                  |
|   Elapsed Time for STRCPY CPY is 31.00                           |
|                                                                  |
|   Elapsed Time for INCRCPY (Slowest) CPY is 136.00               |
|                                                                  |
|   Elapsed Time for INCRCPY (using Register Vars) CPY is 67.00    |
|__________________________________________________________________|
|                                                                  |
|     Type of Copy               Elap    Loop     Real    Ratio    |
|                                Time    Overhd   Copy             |
|__________________________________________________________________|
|                                                                  |
| Copying a 66 byte string 300000 times                            |
| =====================================                            |
|                                                                  |
|   INCRCPY (Static   Pointers)  136.00   2.00  134.00   1.000000  |
|   INCRCPY (Register Pointers)   67.00   2.00   65.00   2.061538  |
|   STRCPY                        31.00   2.00   29.00   4.620690  |
|   MEMCPY                        12.00   2.00   10.00  13.400000  |
|   ASM                            7.00   2.00    5.00  26.800000  |
|                                                        ^         |
|                                                        |         |
|  Note -------------------------------------------------|         |
|__________________________________________________________________|



The next message mentions some C macros that can be  used until  C compilers
are made available with fast string handling builtin.

Continued Message about a minor change to C

Over many, many  months, I  have developed  some C  macros that  will copy
varying length and fixed length strings, and compare  them with  each other.
There are two main versions of these routines:

*  A generic set of macros that should work on  any C  compiler.  These give
   speed improvements of about 10 fold over the usual:

      while (*dest++=*src++);

*  A set of routines that should be usuable with any C compiler that has the
   capability to generate  ASM code  and call  in a  macro assembler.  These
   should give a speed improvement of about 20 fold on an 8086  machine, and
   probably similiar on a 370 type mainframe.

   TWENTY FOLD !
   ===========


   The C Macros and routines mentioned above:

*  Speed copying strings by a factor or 4 to over 20 times.

*  Speed comparing strings by a factor of 2 to over 18 times.

*  Add a generic CPY function for copying one  string to  another (fixed
   and varying length strings).

*  Add a generic  CMP function  for comparing  fixed and  varying length
   strings  with  each other,  and optionally  checking that  the longer
   strings  have  blanks  on  the  end  thus  providing  a  true  string
   comparision.

*  Provide an easier method of defining and using EXTERNAL variables.

*  Automatically truncate  long strings  when they  are copied  to short
   strings  so  that  storage  following  the  shorter  strings  is  not
   accidently overwritten.

*  Automatically pads longer strings with blanks when a short  string is
   copied to a long fixed length string.
                    _____ ______

*  Generic functions written totally in "C" macros can be used  with any
   "C" compiler to give approximately two times speed improvement.

   In addition, special macros  can be  used with  TURBOC and  QUICKC to
   provide a minimum of a three times improvement during the development
   phase.

   Finally, finely tuned Assembler macros can be used to  produce really
   fast  small  production programs  outside the  integrated environment
   (improvements over twenty times can be gained).

*  With a few minor modifications, the macros provided can be  used with
   IBM 370 style machines to enable the effective use of the CLC and MVC
   and the "long" equivalent instructions.
   
------------------------------
 
Yesterday, I sent a group of messages about the speed of string handling
in C, and a program to demonstrate how much faster string handling could
be.

As part of the group of messages, I tried to send sample output from the
program, but due to the line lengths being too long, it didn't all get
there!

Here is a sample of the output from the program.


 __________________________________________________________________
|                                                                  |
|   Elapsed Time for DUMMY CPY (Loop Overhead)   2.00              |
|                                                                  |
|   Elapsed Time for ASM CPY is  7.00                              |
|                                                                  |
|   Elapsed Time for MEMCPY CPY is 12.00                           |
|                                                                  |
|   Elapsed Time for STRCPY CPY is 31.00                           |
|                                                                  |
|   Elapsed Time for INCRCPY (Slowest) CPY is 136.00               |
|                                                                  |
|   Elapsed Time for INCRCPY (using Register Vars) CPY is 67.00    |
|__________________________________________________________________|
|                                                                  |
|     Type of Copy               Elap    Loop     Real    Ratio    |
|                                Time    Overhd   Copy             |
|__________________________________________________________________|
|                                                                  |
| Copying a 66 byte string 300000 times                            |
| =====================================                            |
|                                                                  |
|   INCRCPY (Static   Pointers)  136.00   2.00  134.00   1.000000  |
|   INCRCPY (Register Pointers)   67.00   2.00   65.00   2.061538  |
|   STRCPY                        31.00   2.00   29.00   4.620690  |
|   MEMCPY                        12.00   2.00   10.00  13.400000  |
|   ASM                            7.00   2.00    5.00  26.800000  |
|                                                        ^         |
|                                                        |         |
|  Note -------------------------------------------------|         |
|__________________________________________________________________|


The changes required to be made to C are simple, and will result
in much safer and faster programs.

Again, I'd like to hear your thoughts.

Regards

Clement Victor Clarke


Comments