Original Message re C Strings

This was written in 1989 and sent to various newsgroups and people..

----------

I have been carefully watching the C Echo for about 6 months, and

noticed that many of the messages and mistakes seem to revolve around

STRING Handling.

I would like to see a very simple change made to C to solve many of the

problems, and at the same time, increase performance of C string

handling between 6 and over 20 fold. (Yes 20 is not a typo error!).

The changes to C are quite minor, and will not affect any programs

already written. In addition, it will save the extra coding necessay to

compare strings. For example, it would be wonderful to be able to code:

if (variable=='some literal') ...

instead of:

if (strcmp(variable,"some literal")==0) ...

The changes are:

1. Extend the syntax for single quotes to allow longer than one

character. Thus statements such as:

if (name=='literal')

and

name='literal string'

become valid.

2. Extend the language syntax to allow a three types of string. These

would be:

* Fixed length strings similar to PL/I or PASCAL.

PL/I fixed length strings differ from PASCAL Fixed length

strings in one way only - when PL/I compares a fixed length

string with another string, it effectively pads the shorter

string with blanks so that a true comparison takes place in a

IF statement.

PASCAL and C on the other hand, use the length of the strings

in a compare.

* Varying length strings. Again, I would suggest two main types

- the C and PASCAL type, and the PL/I type.

The same comments about comparisons apply to PL/I varying

strings - the strings are EFFECTIVELY padded with blanks when a

comparison is done.

By simply adding the word 'var' or 'fixed' to a character definition,

optionally followed by the word 'pli', the compiler would be able to

know the difference between the string types.

Thus you could have the following definition:

char string-name char[50] var pli;

or

char name char[100] var; /* C or PASCAL style strings */

Next message contains compiler changes required.

Continued Message about a minor change to C

WHY IS C STRING HANDLING SLOWER THAN IT CAN BE?

-----------------------------------------------

The "C" String Problem

======================

Unlike PL/I, PASCAL, BASIC etc, "C" does not hold the length of a

varying string at the front of the string - rather a string is termi-

nated by a binary 0 at the end of the string. Performance suffers

greatly because a search must be made for the binary 0 before a string

can be copied or compared to or with another string.

Because C uses a Binary 0 at the end of each string instead of keeping

the current length of VARYING strings at the front of the string, when

you copy a string to another string (or compare one string with

another), C must either find out how much data to copy or compare, or it

must look for the binary 0 AS IT IS COPYING.

In addition, "C" string handling is usually performed with expensive

string functions - and functions with parameters are expensive to call

and error prone (for example, there is nothing in "C" to stop a long

string being copied to a short string, and destroying data that follows

the short string).

The real problem is that "C" does not have any inbuilt string

handling facilities, and this probably because the original machine "C"

was designed for (the PDP-11) didn't have them either.

Many machines such as the IBM PC with 8086 chips or equivalent, and IBM

mainframes like the 370 series of computers have instructions (or groups

of instructions) for moving and comparing data quickly. For example:

8086 REP MOVSW and REP MOVSB

370 MVC, MVCL, CLC etc.

Even Z80's had instructions for moving data quickly, and probably the

68000 also has similar instructions.

Therefore, it is most probable that MOST of the computers in the world

(using C) are UNDERUTILIZED because the C language does not have support

for strings BUILTIN.

OTHER PROBLEMS WITH C STRINGS.

------------------------------

In addition to the EFFICENCY problem mentioned above, "C" has no

checking when strings are copied. It is all too easy to copy too much

data from one place to another, and overwrite area of storage

accidently.

PL/I, PASCAL and the IBM 370 Assembler automatically check if too much

data is being moved, and truncates a string copy if necessary.

Continued Message about a minor change to C

Included in this group of messages is a program that you can run to

demonstrate the relative times it takes to copy one string to another.

The following is the result of that small program.

It shows the relative times for various methods of copying strings.

__________________________________________________________________

| |

| Elapsed Time for DUMMY CPY (Loop Overhead) 2.00 |

| |

| Elapsed Time for ASM CPY is 7.00 |

| |

| Elapsed Time for MEMCPY CPY is 12.00 |

| |

| Elapsed Time for STRCPY CPY is 31.00 |

| |

| Elapsed Time for INCRCPY (Slowest) CPY is 136.00 |

| |

| Elapsed Time for INCRCPY (using Register Vars) CPY is 67.00 |

|__________________________________________________________________|

| |

| Type of Copy Elap Loop Real Ratio |

| Time Overhd Copy |

|__________________________________________________________________|

| |

| Copying a 66 byte string 300000 times |

| ===================================== |

| |

| INCRCPY (Static Pointers) 136.00 2.00 134.00 1.000000 |

| INCRCPY (Register Pointers) 67.00 2.00 65.00 2.061538 |

| STRCPY 31.00 2.00 29.00 4.620690 |

| MEMCPY 12.00 2.00 10.00 13.400000 |

| ASM 7.00 2.00 5.00 26.800000 |

| ^ |

| | |

| Note -------------------------------------------------| |

|__________________________________________________________________|

The next message mentions some C macros that can be used until C compilers

are made available with fast string handling builtin.

Continued Message about a minor change to C

Over many, many months, I have developed some C macros that will copy

varying length and fixed length strings, and compare them with each other.

There are two main versions of these routines:

* A generic set of macros that should work on any C compiler. These give

speed improvements of about 10 fold over the usual:

while (*dest++=*src++);

* A set of routines that should be usuable with any C compiler that has the

capability to generate ASM code and call in a macro assembler. These

should give a speed improvement of about 20 fold on an 8086 machine, and

probably similiar on a 370 type mainframe.

TWENTY FOLD !

===========

The C Macros and routines mentioned above:

* Speed copying strings by a factor or 4 to over 20 times.

* Speed comparing strings by a factor of 2 to over 18 times.

* Add a generic CPY function for copying one string to another (fixed

and varying length strings).

* Add a generic CMP function for comparing fixed and varying length

strings with each other, and optionally checking that the longer

strings have blanks on the end thus providing a true string

comparision.

* Provide an easier method of defining and using EXTERNAL variables.

* Automatically truncate long strings when they are copied to short

strings so that storage following the shorter strings is not

accidently overwritten.

* Automatically pads longer strings with blanks when a short string is

copied to a long fixed length string.

_____ ______

* Generic functions written totally in "C" macros can be used with any

"C" compiler to give approximately two times speed improvement.

In addition, special macros can be used with TURBOC and QUICKC to

provide a minimum of a three times improvement during the development

phase.

Finally, finely tuned Assembler macros can be used to produce really

fast small production programs outside the integrated environment

(improvements over twenty times can be gained).

* With a few minor modifications, the macros provided can be used with

IBM 370 style machines to enable the effective use of the CLC and MVC

and the "long" equivalent instructions.

------------------------------

Yesterday, I sent a group of messages about the speed of string handling

in C, and a program to demonstrate how much faster string handling could

be.

As part of the group of messages, I tried to send sample output from the

program, but due to the line lengths being too long, it didn't all get

there!

Here is a sample of the output from the program.

__________________________________________________________________

| |

| Elapsed Time for DUMMY CPY (Loop Overhead) 2.00 |

| |

| Elapsed Time for ASM CPY is 7.00 |

| |

| Elapsed Time for MEMCPY CPY is 12.00 |

| |

| Elapsed Time for STRCPY CPY is 31.00 |

| |

| Elapsed Time for INCRCPY (Slowest) CPY is 136.00 |

| |

| Elapsed Time for INCRCPY (using Register Vars) CPY is 67.00 |

|__________________________________________________________________|

| |

| Type of Copy Elap Loop Real Ratio |

| Time Overhd Copy |

|__________________________________________________________________|

| |

| Copying a 66 byte string 300000 times |

| ===================================== |

| |

| INCRCPY (Static Pointers) 136.00 2.00 134.00 1.000000 |

| INCRCPY (Register Pointers) 67.00 2.00 65.00 2.061538 |

| STRCPY 31.00 2.00 29.00 4.620690 |

| MEMCPY 12.00 2.00 10.00 13.400000 |

| ASM 7.00 2.00 5.00 26.800000 |

| ^ |

| | |

| Note -------------------------------------------------| |

|__________________________________________________________________|

The changes required to be made to C are simple, and will result

in much safer and faster programs.

Again, I'd like to hear your thoughts.

Regards

Clement Victor Clarke