Original Message re C Strings
This was written in 1989 and sent to various newsgroups and people..
----------
I have been carefully watching the C Echo for about 6 months, and
noticed that many of the messages and mistakes seem to revolve around
STRING Handling.
I would like to see a very simple change made to C to solve many of the
problems, and at the same time, increase performance of C string
handling between 6 and over 20 fold. (Yes 20 is not a typo error!).
The changes to C are quite minor, and will not affect any programs
already written. In addition, it will save the extra coding necessay to
compare strings. For example, it would be wonderful to be able to code:
if (variable=='some literal') ...
instead of:
if (strcmp(variable,"some literal")==0) ...
The changes are:
1. Extend the syntax for single quotes to allow longer than one
character. Thus statements such as:
if (name=='literal')
and
name='literal string'
become valid.
2. Extend the language syntax to allow a three types of string. These
would be:
* Fixed length strings similar to PL/I or PASCAL.
PL/I fixed length strings differ from PASCAL Fixed length
strings in one way only - when PL/I compares a fixed length
string with another string, it effectively pads the shorter
string with blanks so that a true comparison takes place in a
IF statement.
PASCAL and C on the other hand, use the length of the strings
in a compare.
* Varying length strings. Again, I would suggest two main types
- the C and PASCAL type, and the PL/I type.
The same comments about comparisons apply to PL/I varying
strings - the strings are EFFECTIVELY padded with blanks when a
comparison is done.
By simply adding the word 'var' or 'fixed' to a character definition,
optionally followed by the word 'pli', the compiler would be able to
know the difference between the string types.
Thus you could have the following definition:
char string-name char[50] var pli;
or
char name char[100] var; /* C or PASCAL style strings */
Next message contains compiler changes required.
Continued Message about a minor change to C
WHY IS C STRING HANDLING SLOWER THAN IT CAN BE?
-----------------------------------------------
The "C" String Problem
======================
Unlike PL/I, PASCAL, BASIC etc, "C" does not hold the length of a
varying string at the front of the string - rather a string is termi-
nated by a binary 0 at the end of the string. Performance suffers
greatly because a search must be made for the binary 0 before a string
can be copied or compared to or with another string.
Because C uses a Binary 0 at the end of each string instead of keeping
the current length of VARYING strings at the front of the string, when
you copy a string to another string (or compare one string with
another), C must either find out how much data to copy or compare, or it
must look for the binary 0 AS IT IS COPYING.
In addition, "C" string handling is usually performed with expensive
string functions - and functions with parameters are expensive to call
and error prone (for example, there is nothing in "C" to stop a long
string being copied to a short string, and destroying data that follows
the short string).
The real problem is that "C" does not have any inbuilt string
handling facilities, and this probably because the original machine "C"
was designed for (the PDP-11) didn't have them either.
Many machines such as the IBM PC with 8086 chips or equivalent, and IBM
mainframes like the 370 series of computers have instructions (or groups
of instructions) for moving and comparing data quickly. For example:
8086 REP MOVSW and REP MOVSB
370 MVC, MVCL, CLC etc.
Even Z80's had instructions for moving data quickly, and probably the
68000 also has similar instructions.
Therefore, it is most probable that MOST of the computers in the world
(using C) are UNDERUTILIZED because the C language does not have support
for strings BUILTIN.
OTHER PROBLEMS WITH C STRINGS.
------------------------------
In addition to the EFFICENCY problem mentioned above, "C" has no
checking when strings are copied. It is all too easy to copy too much
data from one place to another, and overwrite area of storage
accidently.
PL/I, PASCAL and the IBM 370 Assembler automatically check if too much
data is being moved, and truncates a string copy if necessary.
Continued Message about a minor change to C
Included in this group of messages is a program that you can run to
demonstrate the relative times it takes to copy one string to another.
The following is the result of that small program.
It shows the relative times for various methods of copying strings.
__________________________________________________________________
| |
| Elapsed Time for DUMMY CPY (Loop Overhead) 2.00 |
| |
| Elapsed Time for ASM CPY is 7.00 |
| |
| Elapsed Time for MEMCPY CPY is 12.00 |
| |
| Elapsed Time for STRCPY CPY is 31.00 |
| |
| Elapsed Time for INCRCPY (Slowest) CPY is 136.00 |
| |
| Elapsed Time for INCRCPY (using Register Vars) CPY is 67.00 |
|__________________________________________________________________|
| |
| Type of Copy Elap Loop Real Ratio |
| Time Overhd Copy |
|__________________________________________________________________|
| |
| Copying a 66 byte string 300000 times |
| ===================================== |
| |
| INCRCPY (Static Pointers) 136.00 2.00 134.00 1.000000 |
| INCRCPY (Register Pointers) 67.00 2.00 65.00 2.061538 |
| STRCPY 31.00 2.00 29.00 4.620690 |
| MEMCPY 12.00 2.00 10.00 13.400000 |
| ASM 7.00 2.00 5.00 26.800000 |
| ^ |
| | |
| Note -------------------------------------------------| |
|__________________________________________________________________|
The next message mentions some C macros that can be used until C compilers
are made available with fast string handling builtin.
Continued Message about a minor change to C
Over many, many months, I have developed some C macros that will copy
varying length and fixed length strings, and compare them with each other.
There are two main versions of these routines:
* A generic set of macros that should work on any C compiler. These give
speed improvements of about 10 fold over the usual:
while (*dest++=*src++);
* A set of routines that should be usuable with any C compiler that has the
capability to generate ASM code and call in a macro assembler. These
should give a speed improvement of about 20 fold on an 8086 machine, and
probably similiar on a 370 type mainframe.
TWENTY FOLD !
===========
The C Macros and routines mentioned above:
* Speed copying strings by a factor or 4 to over 20 times.
* Speed comparing strings by a factor of 2 to over 18 times.
* Add a generic CPY function for copying one string to another (fixed
and varying length strings).
* Add a generic CMP function for comparing fixed and varying length
strings with each other, and optionally checking that the longer
strings have blanks on the end thus providing a true string
comparision.
* Provide an easier method of defining and using EXTERNAL variables.
* Automatically truncate long strings when they are copied to short
strings so that storage following the shorter strings is not
accidently overwritten.
* Automatically pads longer strings with blanks when a short string is
copied to a long fixed length string.
_____ ______
* Generic functions written totally in "C" macros can be used with any
"C" compiler to give approximately two times speed improvement.
In addition, special macros can be used with TURBOC and QUICKC to
provide a minimum of a three times improvement during the development
phase.
Finally, finely tuned Assembler macros can be used to produce really
fast small production programs outside the integrated environment
(improvements over twenty times can be gained).
* With a few minor modifications, the macros provided can be used with
IBM 370 style machines to enable the effective use of the CLC and MVC
and the "long" equivalent instructions.
------------------------------
Yesterday, I sent a group of messages about the speed of string handling
in C, and a program to demonstrate how much faster string handling could
be.
As part of the group of messages, I tried to send sample output from the
program, but due to the line lengths being too long, it didn't all get
there!
Here is a sample of the output from the program.
__________________________________________________________________
| |
| Elapsed Time for DUMMY CPY (Loop Overhead) 2.00 |
| |
| Elapsed Time for ASM CPY is 7.00 |
| |
| Elapsed Time for MEMCPY CPY is 12.00 |
| |
| Elapsed Time for STRCPY CPY is 31.00 |
| |
| Elapsed Time for INCRCPY (Slowest) CPY is 136.00 |
| |
| Elapsed Time for INCRCPY (using Register Vars) CPY is 67.00 |
|__________________________________________________________________|
| |
| Type of Copy Elap Loop Real Ratio |
| Time Overhd Copy |
|__________________________________________________________________|
| |
| Copying a 66 byte string 300000 times |
| ===================================== |
| |
| INCRCPY (Static Pointers) 136.00 2.00 134.00 1.000000 |
| INCRCPY (Register Pointers) 67.00 2.00 65.00 2.061538 |
| STRCPY 31.00 2.00 29.00 4.620690 |
| MEMCPY 12.00 2.00 10.00 13.400000 |
| ASM 7.00 2.00 5.00 26.800000 |
| ^ |
| | |
| Note -------------------------------------------------| |
|__________________________________________________________________|
The changes required to be made to C are simple, and will result
in much safer and faster programs.
Again, I'd like to hear your thoughts.
Regards
Clement Victor Clarke