Unsurprisingly, C has the notion of characters to represent text. Characters belong to an execution character set, which includes the following:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ! " # % & ' ( ) * + , - . / : ; < = > ? [ \ ] ^ _ { | } ~
…plus a space character, vertical tab, horizontal tab,
form feed, alert, backspace, carriage return, new line, and a
null character. These characters form the basic
execution character set, but the full set may include
other characters, depending on the LC_
category of the
current locale, which
can change during execution.
Text may enter or leave an executing program through
streams,
allowing it to be stored in and retrieved from files, read from
stdin
(which often permits interactive
user input), or written to stdout
and stderr
(which often results in output
on a console or terminal). Sequences of characters may be
stored in the executing program's memory, and manipulated as
strings, which are conventionally terminated by
null characters.
Characters and strings are represented by sequences of (usually small) integers. The rules that govern how an integer sequence should be interpreted as a character sequence constitute an encoding. Four integer types of varying sizes are used to hold encoded characters:
-
char
will usually be the most compact type, and uses a locale-dependent multibyte encoding, or UTF-8. -
wchar_t
is usually much wider, allowing it to encode a large set of characters using only onewchar_t
per character, and without shift states. However, it is still locale-dependent.This type forbids multi"byte" encoding and shift states, right?
-
char16_t
is capable of holding characters encoded according to UTF-16, so it has at least 16 bits. -
char32_t
is capable of holding raw UCS-32 characters, so it has at least 31 bits.
The null character is always encoded with the value zero,
which never appears in multibyte or shift sequences. The
digits 0
to 9
are always encoded such that subtracting one from another
yields their numeric difference, e.g., subtracting 4
from 7
yields 3.
Strings are stored as arrays of these types. For example, the
variable str
is
capable of storing 12 bytes of encoded characters:
char str[12];
Strings, being arrays, are passed to functions in the same way, as a
pointer to
the first element. Conventionally, the end of a string is
marked by a null character. These example implementations of
strlen
and strcpy
demonstrate how they search for
it:
size_t strlen(const char *s) { char *orig = s; while (*s) s++; return s - orig; // length in bytes, excluding the null terminator } char *strcpy(char *to, const char *from) { char *orig = to; while (*from) { *to = *from; to++; from++; } return orig; }
A character can be embedded directly in the program
source
as a character constant. For example, to declare a
variable initially with the integer value for the character
f
:
char c = 'f';
The construct 'f'
is a character
constant, and it matches the grammar of character-constant. It actually has the
type int
, which of
course gets converted to a char
with the
correct value. For wchar_t
, use L'f'
. For char16_t
, use u'f'
. For char32_t
, use U'f'
.
Almost every character from the basic character set can be expressed simply by surrounding it in single quotes. However, control characters, or a literal single quote, must be expressed using escape sequences:
single quote | '\'' |
double quote | '\"' |
question mark | '\?' |
carriage return | '\r' |
vertical tab | '\v' |
horizontal tab | '\t' |
new line | '\n' |
alert | '\a' |
backspace | '\b' |
form feed | '\f' |
null | '\0' |
The form \?
is usually unnecessary,
and is only provided to prevent unintentional use of
trigraphs.
'\"'
is also unnecessary, as
'"'
works fine in a character
constant. The final form \0
is
actually an octal-escape-sequence, representing the literal
value zero, which is guaranteed to be the encoding for a null
character. Other values of upto 9 bits (3 octal digits) can
also be specified, just by using the prefix \0
, e.g., \0173
has the
numeric value 1111011 in binary, or 123 in denary. An escape
sequence of the form \x7b
similarly
expresses the same numeric value as a hexadecimal. An escape
sequence of the form \u200b
or
\U00013ef
identifies a Unicode
character.
- character-constant
' c-char-sequence '
L' c-char-sequence '
u' c-char-sequence '
U' c-char-sequence '
- c-char-sequence
c-char
c-char-sequence c-char
- c-char
-
any member of the source character set except
'
,\
and the new-line character escape-sequence
- escape-sequence
simple-escape-sequence
octal-escape-sequence
hexadecimal-escape-sequence
universal-character-name
- simple-escape-sequence
\'
\"
\?
\\
\a
\b
\f
\n
\r
\t
\v
- octal-escape-sequence
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
- hexadecimal-escape-sequence
\x hexadecimal-digit
hexadecimal-escape-sequence hexadecimal-digit
- universal-character-name
\u hex-quad
\U hex-quad hex-quad
- hex-quad
hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
A character array can be initialized with a string using a string literal. For example:
// explicit length with no padding char s6[12] = "I am a fish"; // explicit length with no padding and no null terminator char s6[11] = "I am a fish"; // padded with zeros char s7[20] = "I am a fish"; // length computed from initializer char s8[] = "I am a fish";
The construct "I am a fish"
is a
string literal, and matches the grammar of string-literal. As an array
initializer, it is equivalent to:
{ 'I', ' ', 'a', 'm', ' ', 'a', ' ', 'f', 'i', 's', 'h', '\0' }
…except that it is not an error if there is space in the
array for everything except the terminating null. Use
L"I am a fish"
for an equivalent
string of wchar_t
. Use u"I am a fish"
for an equivalent string of
char16_t
. Use U"I am a fish"
for an equivalent string of
char32_t
. Use u8"I am a fish"
for an equivalent string of
char
guaranteed
to be encoded in UTF-8.
When a string literal is not used to initialize an array, it is treated as an anonymous static object. For example, this:
puts("I am a fish");
…behaves as if an anonymous array is declared and initialized, and then is used in place of the literal:
static char foo[] = "I am a fish"; puts(foo);
The example relies on the name of an array decaying into a
pointer to
its first element, in line with pointer arithmetic. In an
expression in which an array name would not decay, such as
sizeof "I am a fish"
, the string
literal continues to be treated as an array, and the example
expression yields 12
, the size of the
equivalent array.
Within a string literal, the same escape sequences as for
character constants can be used for special characters,
except that \'
is redundant, and
\"
is mandatory.
- string-literal
encoding-prefix " s-char-sequenceopt "
- encoding-prefix
u8
u
U
L
- s-char-sequence
s-char
s-char-sequence s-char
- s-char
-
any member of the source character set except
"
,\
and the new-line character escape-sequence
The encoding for locale-dependent string literals and
character constents is implementation-defined. It is selected
by using the C locale in the category
LC_
, i.e., setlocale(LC_
. This is the
default, and must be restored before accessing any string
literals or character constants containing characters outside
of the basic set.