Text — CHaR

Unsurprisingly, C has the notion of characters to represent text. Characters belong to an execution character set, which includes the following:

0 1 2 3 4 5 6 7 8 9
a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~

…plus a space character, vertical tab, horizontal tab, form feed, alert, backspace, carriage return, new line, and a null character. These characters form the basic execution character set, but the full set may include other characters, depending on the LC_CTYPE category of the current locale, which can change during execution.

Text may enter or leave an executing program through streams, allowing it to be stored in and retrieved from files, read from stdin (which often permits interactive user input), or written to stdout and stderr (which often results in output on a console or terminal). Sequences of characters may be stored in the executing program's memory, and manipulated as strings, which are conventionally terminated by null characters.

Characters and strings are represented by sequences of (usually small) integers. The rules that govern how an integer sequence should be interpreted as a character sequence constitute an encoding. Four integer types of varying sizes are used to hold encoded characters:

char will usually be the most compact type, and uses a locale-dependent multibyte encoding, or UTF-8.
wchar_t is usually much wider, allowing it to encode a large set of characters using only one wchar_t per character, and without shift states. However, it is still locale-dependent.

This type forbids multi"byte" encoding and shift states, right?
char16_t is capable of holding characters encoded according to UTF-16, so it has at least 16 bits.
char32_t is capable of holding raw UCS-32 characters, so it has at least 31 bits.

The null character is always encoded with the value zero, which never appears in multibyte or shift sequences. The digits 0 to 9 are always encoded such that subtracting one from another yields their numeric difference, e.g., subtracting 4 from 7 yields 3.

Strings are stored as arrays of these types. For example, the variable str is capable of storing 12 bytes of encoded characters:

char str[12];

Strings, being arrays, are passed to functions in the same way, as a pointer to the first element. Conventionally, the end of a string is marked by a null character. These example implementations of strlen and strcpy demonstrate how they search for it:

size_t strlen(const char *s)
{
  char *orig = s;
  while (*s)
    s++;
  return s - orig; // length in bytes, excluding the null terminator
}

char *strcpy(char *to, const char *from)
{
  char *orig = to;
  while (*from) {
    *to = *from;
    to++;
    from++;
  }
  return orig;
}

A character can be embedded directly in the program source as a character constant. For example, to declare a variable initially with the integer value for the character f:

char c = 'f';

The construct 'f' is a character constant, and it matches the grammar of character-constant. It actually has the type int, which of course gets converted to a char with the correct value. For wchar_t, use L'f'. For char16_t, use u'f'. For char32_t, use U'f'.

Almost every character from the basic character set can be expressed simply by surrounding it in single quotes. However, control characters, or a literal single quote, must be expressed using escape sequences:

single quote	`'\''`
double quote	`'\"'`
question mark	`'\?'`
carriage return	`'\r'`
vertical tab	`'\v'`
horizontal tab	`'\t'`
new line	`'\n'`
alert	`'\a'`
backspace	`'\b'`
form feed	`'\f'`
null	`'\0'`

The form \? is usually unnecessary, and is only provided to prevent unintentional use of trigraphs. '\"' is also unnecessary, as '"' works fine in a character constant. The final form \0 is actually an octal-escape-sequence, representing the literal value zero, which is guaranteed to be the encoding for a null character. Other values of upto 9 bits (3 octal digits) can also be specified, just by using the prefix \0, e.g., \0173 has the numeric value 1111011 in binary, or 123 in denary. An escape sequence of the form \x7b similarly expresses the same numeric value as a hexadecimal. An escape sequence of the form \u200b or \U00013ef identifies a Unicode character.

character-constant: ' c-char-sequence '; L' c-char-sequence '; u' c-char-sequence '; U' c-char-sequence '
c-char-sequence: c-char; c-char-sequence c-char
c-char: any member of the source character set except ', \ and the new-line character; escape-sequence
escape-sequence: simple-escape-sequence; octal-escape-sequence; hexadecimal-escape-sequence; universal-character-name
simple-escape-sequence: \'; \"; \?; \\; \a; \b; \f; \n; \r; \t; \v
octal-escape-sequence: \ octal-digit; \ octal-digit octal-digit; \ octal-digit octal-digit octal-digit
hexadecimal-escape-sequence: \x hexadecimal-digit; hexadecimal-escape-sequence hexadecimal-digit
universal-character-name: \u hex-quad; \U hex-quad hex-quad
hex-quad: hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit

A character array can be initialized with a string using a string literal. For example:

// explicit length with no padding
char s6[12] = "I am a fish";

// explicit length with no padding and no null terminator
char s6[11] = "I am a fish";

// padded with zeros
char s7[20] = "I am a fish";

// length computed from initializer
char s8[] = "I am a fish";

The construct "I am a fish" is a string literal, and matches the grammar of string-literal. As an array initializer, it is equivalent to:

{ 'I', ' ', 'a', 'm', ' ', 'a',
  ' ', 'f', 'i', 's', 'h', '\0' }

…except that it is not an error if there is space in the array for everything except the terminating null. Use L"I am a fish" for an equivalent string of wchar_t. Use u"I am a fish" for an equivalent string of char16_t. Use U"I am a fish" for an equivalent string of char32_t. Use u8"I am a fish" for an equivalent string of char guaranteed to be encoded in UTF-8.

When a string literal is not used to initialize an array, it is treated as an anonymous static object. For example, this:

puts("I am a fish");

…behaves as if an anonymous array is declared and initialized, and then is used in place of the literal:

static char foo[] = "I am a fish";
puts(foo);

The example relies on the name of an array decaying into a pointer to its first element, in line with pointer arithmetic. In an expression in which an array name would not decay, such as sizeof "I am a fish", the string literal continues to be treated as an array, and the example expression yields 12, the size of the equivalent array.

Within a string literal, the same escape sequences as for character constants can be used for special characters, except that \' is redundant, and \" is mandatory.

string-literal: encoding-prefix " s-char-sequence_opt "
encoding-prefix: u8; u; U; L
s-char-sequence: s-char; s-char-sequence s-char
s-char: any member of the source character set except ", \ and the new-line character; escape-sequence

The encoding for locale-dependent string literals and character constents is implementation-defined. It is selected by using the C locale in the category LC_CTYPE, i.e., setlocale(LC_CTYPE, "C"). This is the default, and must be restored before accessing any string literals or character constants containing characters outside of the basic set.