Translation

Names specified here
Name	Description	Notes				Source	Availability
`__DATE__`	Compilation date	L		M		Predefined	C89	C90	C95	C99	C11
`__TIME__`	Compilation time	L		M		Predefined	C89	C90	C95	C99	C11

A C program consists of several source files, with names conventionally ending in .c. Each source file undergoes a process of translation, often called compilation (although interpreters also exist), and the results of these processes are linked to produce the executable program. Each translation process is independent of any other translation process, so nothing learned during one process is retained to halp another translation. Headers are used to ensure that information that must be shared across source files is consistent during their separate translation processes.

Source files and headers are composed of characters from the source character set, which includes all of the following:

0 1 2 3 4 5 6 7 8 9
a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~

There is also a space character, and some control characters: vertical tab, horizontal tab, and form feed. These characters form the basic source character set, but there may be additional characters, depending on the locale that applies during translation. Finally, the source text is split into lines using an implementation-defined line terminator.

The output of a translation process is usually a file, often with a name related to the source file, and ending with .o or .obj, such that foo.c translates to foo.o, for example. However, this concept of a ‘file’ is internal to the implementation, and need not correspond to a real file.

There are several phases of translation:

In the first phase, the source file is interpreted as a series of characters in the source character set. Line terminators are translated into new-line characters. Each trigraph is interpreted as a single character:
- ??= for #
- ??( for [
- ??) for ]
- ??< for {
- ??> for }
- ??! for |
- ??' for ^
- ??- for ~
- ??/ for \
In most cases, notwithstanding trigraphs, this phase will be a no-op, as most implementations will want to keep things simple by using the native representation of characters.

From beyond this phase, each source character is considered a unit, even if it originally was represented by multiple bytes.
Each backslash character \ followed by a new-line character is deleted. This allows long lines to be split for readability, without changing their meaning. This is especially useful in macro definitions, which have no other way of being split.
The source file is parsed as file-tokens. Each comment is replaced by a space sp.

file-tokens

file-token

file-tokens file-token

file-token

preprocessing-token

new-line

sp
other white-space characters

preprocessing-token

header-name
only as part of an #include directive

identifier

pp-number

character-constant

string-literal

punctuator

any other character that does not match the other productions

new-line

the new-line character

pp-number

digit

. digit

pp-number digit

pp-number identifier-nondigit

pp-number e sign

pp-number E sign

pp-number p sign

pp-number P sign

pp-number .

punctuator

any of [ ] ( ) { } . -> ++ -- & * + - ~ ! / % << >> < > <= >= == != ^ | && || ? : ; ... = *= /= %= += -= <<= >>= &= ^= |= , # ## <: :> <% %> %: %:%:, matching longer sequences first

sign

+

-

digit

any of 0 1 2 3 4 5 6 7 8 9
The file-tokens are parsed against preprocessing-file, and are thus scanned in sequence for preprocessing directives, macro invocations and _Pragma expressions. Directives and _Pragma expressions are executed, and macro invocations are expanded recursively. Preprocessing directives are then deleted.

When an #include directive is executed, its specified file undergoes the first four translation phases, and its resulting file-tokens replace the directive.
Each character constant is converted into a single character in the execution character set. Each string literal is converted into a sequence of characters in the execution character set.

Implementation-defined behaviour occurs if some source characters cannot be converted to corresponding execution characters.
Adjacent string literals are concatenated into one.
new-line and sp are discarded, leaving only preprocessing-tokens. Each preprocessing-token then becomes a token, and the new sequence of tokens is parsed as a translation-unit.

token

keyword

identifier

constant

string-literal

punctuator

The ordering of these phases is very important. Consider this program and its output:

#include <stdio.h>

char foo1[] = "aa\
nn";
char foo2[] = "aa\\
nn";

main()
{
  printf("foo1: [%s]\n", foo1);
  printf("foo2: [%s]\n", foo2);
  return 0;
}

foo1: [aann]
foo2: [aa
n]

foo1 has a string literal split over too lines using a trailing backslash. foo2 appears to be illegally split, because it ends with a pair of backslahes, which should be translated to a single, literal backslash. However, this doesn't happen, as the second of these backslashes and its following new-line character are translated in phase 2, so the two lines are first joined to give:

char foo2[] = "aa\nn";

And the first backslash and its following n are interpreted in the later phase 5 as a literal new-line character in the execution character set.

__DATE__ and __TIME__ are macros expanding to the date and time respectively of a moment during translation. __DATE__ has the format Mmm dd yyyy, where Mmm is the month name as generated by asctime, dd is the day of the month with a leading space if necessary, and yyyy is the year. __TIME__ has the format hh:mm:ss for hours, minutes and seconds.