Unsurprisingly, C has the notion of characters to represent text. Characters belong to an execution character set, which includes the following:

0 1 2 3 4 5 6 7 8 9
a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~

…plus a space character, vertical tab, horizontal tab, form feed, alert, backspace, carriage return, new line, and a null character. These characters form the basic execution character set, but the full set may include other characters, depending on the LC_CTYPE category of the current locale, which can change during execution.

Text may enter or leave an executing program through streams, allowing it to be stored in and retrieved from files, read from stdin (which often permits interactive user input), or written to stdout and stderr (which often results in output on a console or terminal). Sequences of characters may be stored in the executing program's memory, and manipulated as strings, which are conventionally terminated by null characters.

Characters and strings are represented by sequences of (usually small) integers. The rules that govern how an integer sequence should be interpreted as a character sequence constitute an encoding. Four integer types of varying sizes are used to hold encoded characters:

The null character is always encoded with the value zero, which never appears in multibyte or shift sequences. The digits 0 to 9 are always encoded such that subtracting one from another yields their numeric difference, e.g., subtracting 4 from 7 yields 3.

Strings are stored as arrays of these types. For example, the variable str is capable of storing 12 bytes of encoded characters:

char str[12];

Strings, being arrays, are passed to functions in the same way, as a pointer to the first element. Conventionally, the end of a string is marked by a null character. These example implementations of strlen and strcpy demonstrate how they search for it:

size_t strlen(const char *s)
{
  char *orig = s;
  while (*s)
    s++;
  return s - orig; // length in bytes, excluding the null terminator
}

char *strcpy(char *to, const char *from)
{
  char *orig = to;
  while (*from) {
    *to = *from;
    to++;
    from++;
  }
  return orig;
}

A character can be embedded directly in the program source as a character constant. For example, to declare a variable initially with the integer value for the character f:

char c = 'f';

The construct 'f' is a character constant, and it matches the grammar of character-constant. It actually has the type int, which of course gets converted to a char with the correct value. For wchar_t, use L'f'. For char16_t, use u'f'. For char32_t, use U'f'.

Almost every character from the basic character set can be expressed simply by surrounding it in single quotes. However, control characters, or a literal single quote, must be expressed using escape sequences:

single quote '\''
double quote '\"'
question mark '\?'
carriage return '\r'
vertical tab '\v'
horizontal tab '\t'
new line '\n'
alert '\a'
backspace '\b'
form feed '\f'
null '\0'

The form \? is usually unnecessary, and is only provided to prevent unintentional use of trigraphs. '\"' is also unnecessary, as '"' works fine in a character constant. The final form \0 is actually an octal-escape-sequence, representing the literal value zero, which is guaranteed to be the encoding for a null character. Other values of upto 9 bits (3 octal digits) can also be specified, just by using the prefix \0, e.g., \0173 has the numeric value 1111011 in binary, or 123 in denary. An escape sequence of the form \x7b similarly expresses the same numeric value as a hexadecimal. An escape sequence of the form \u200b or \U00013ef identifies a Unicode character.

character-constant
' c-char-sequence '
L' c-char-sequence '
u' c-char-sequence '
U' c-char-sequence '
c-char-sequence
c-char
c-char-sequence c-char
c-char
any member of the source character set except ', \ and the new-line character
escape-sequence
escape-sequence
simple-escape-sequence
octal-escape-sequence
hexadecimal-escape-sequence
universal-character-name
simple-escape-sequence
\'
\"
\?
\\
\a
\b
\f
\n
\r
\t
\v
octal-escape-sequence
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
hexadecimal-escape-sequence
\x hexadecimal-digit
hexadecimal-escape-sequence hexadecimal-digit
universal-character-name
\u hex-quad
\U hex-quad hex-quad
hex-quad
hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit

A character array can be initialized with a string using a string literal. For example:

// explicit length with no padding
char s6[12] = "I am a fish";

// explicit length with no padding and no null terminator
char s6[11] = "I am a fish";

// padded with zeros
char s7[20] = "I am a fish";

// length computed from initializer
char s8[] = "I am a fish";

The construct "I am a fish" is a string literal, and matches the grammar of string-literal. As an array initializer, it is equivalent to:

{ 'I', ' ', 'a', 'm', ' ', 'a',
  ' ', 'f', 'i', 's', 'h', '\0' } 

…except that it is not an error if there is space in the array for everything except the terminating null. Use L"I am a fish" for an equivalent string of wchar_t. Use u"I am a fish" for an equivalent string of char16_t. Use U"I am a fish" for an equivalent string of char32_t. Use u8"I am a fish" for an equivalent string of char guaranteed to be encoded in UTF-8.

When a string literal is not used to initialize an array, it is treated as an anonymous static object. For example, this:

puts("I am a fish");

…behaves as if an anonymous array is declared and initialized, and then is used in place of the literal:

static char foo[] = "I am a fish";
puts(foo);

The example relies on the name of an array decaying into a pointer to its first element, in line with pointer arithmetic. In an expression in which an array name would not decay, such as sizeof "I am a fish", the string literal continues to be treated as an array, and the example expression yields 12, the size of the equivalent array.

Within a string literal, the same escape sequences as for character constants can be used for special characters, except that \' is redundant, and \" is mandatory.

string-literal
encoding-prefix " s-char-sequenceopt "
encoding-prefix
u8
u
U
L
s-char-sequence
s-char
s-char-sequence s-char
s-char
any member of the source character set except ", \ and the new-line character
escape-sequence

The encoding for locale-dependent string literals and character constents is implementation-defined. It is selected by using the C locale in the category LC_CTYPE, i.e., setlocale(LC_CTYPE, "C"). This is the default, and must be restored before accessing any string literals or character constants containing characters outside of the basic set.


CHaR
Sitemap Supported
Site format updated 2024-06-05T22:37:07.391+0000
Data updated 1970-01-01T00:00:00.000+0000
Page updated 2022-06-17T21:43:05.000+0000