For Gauche 0.9.7

Next: , Previous: , Up: Library modules - Gauche extensions   [Contents][Index]

9.3 gauche.cgen - Generating C code

Significant part of Gauche is written in Gauche or S-expression based DSL. During the building process, they are converted into C sources and then compiled by C compiler. The gauche.cgen module and its submodules expose the functionality Gauche build process is using to the general use.

Required features for a C code generator differ greatly among applications, and too much scaffolding could be a constraint for the module users. So, instead of providing a single solid framework, we provide a set of loosely coupled modules so that you can combine necessary features freely. In fact, some of Gauche build process only use gauche.cgen.unit and gauche.cgen.literal (see src/builtin-syms.scm, for example).

Module: gauche.cgen

This is a convenience module that extends gauche.cgen.unit, gauche.cgen.literal, gauche.cgen.type and gauche.cgen.cise together.

Usually you can just use gauche.cgen and don’t need to think about individual submodules. The following subsections are organized by submodules only for the convenience of explanation.

Next: , Previous: , Up: Generating C code   [Contents][Index]

9.3.1 Generating C source files

One of the tricky issues about generating C source is that you have to put several fragments of code in different parts of the source file, even you want to say just one thing—that is, sometimes you have to put declaration before the actual definition, plus some setup code that needs to be run at initialization time.

Creating a frame

Class: <cgen-unit>

{gauche.cgen} A cgen-unit is a unit of C source generation. It corresponds to one .c file, and optionally one .h file. During the processing, a "current unit" is kept in a parameter cgen-current-unit, and most cgen APIs implicitly work to it.

The following slot are for public use. They are used to tailor the output. Usually you set those slots at initialization time. The effect is undefined if you change them in the middle of the code generation process.

Instance Variable of <cgen-unit>: name

A string to name this unit. This is used for the default name of the generated files (name.c and name.h) and the suffix of the default name of initialization function. Other cgen modules may use this to generate names. Avoid using characters that are not valid for C identifiers.

You can override those default names by setting the other slots.

Instance Variable of <cgen-unit>: c-file
Instance Variable of <cgen-unit>: h-file

The name of the C source file and header file, in strings. If they are #f (by default), the value of name slot is used as the file name, with extension .c or .h is attached, respectively.

To get the file names to be generated, use cgen-unit-c-file and cgen-unit-h-file generic functions, instead of reading these slots.

Instance Variable of <cgen-unit>: preamble

A list of strings to be inserted at the top of the generated sources. The default value is ("/* Generated by gauche.cgen */"). Each string appears in its own line.

Instance Variable of <cgen-unit>: init-prologue
Instance Variable of <cgen-init>: init-epilogue

A string to start or to end the initialization function, respectively. The default value of init-prologue is "void Scm_Init_NAME(void) {" where NAME is the value of the name slot. The default value of init-epilogue is just "}". Each string appears in its own line.

To get the default initialization function name, use cgen-unit-init-name generic function.

To customize initialization function name, arguments and/or return type, set init-prologue.

The content of initialization function is filled by the code fragments registered by cgen-init.

Parameter: cgen-current-unit

A parameter to keep the current cgen-unit.

A typical flow of generating C code is as follows:

  1. Create a <cgen-unit> and make it the current unit.
  2. Call code insertion APIs with code fragments. Fragments are accumulated in the current unit.
  3. Call emit method on the unit, which generates a C file and optionally a header file.
Generic Function: cgen-emit-c cgen-unit
Generic Function: cgen-emit-h cgen-unit

{gauche.cgen} Write the accumulated code fragments in cgen-unit to a C source file and C header file. The name of the files are determined by calling cgen-unit-c-file and cgen-unit-h-file, respectively. If the files already exist, its content is overwritten; you can’t gradually write to the files. So, usually these procedures are called at the last step of the code generation.

We’ll explain the details of how each file is organized under “Filling the content” section below.

Generic Function: cgen-unit-c-file cgen-unit
Generic Function: cgen-unit-h-file cgen-unit

{gauche.cgen} Returns a string that names C source and header file for cgen-unit, respectively. The default method first looks at c-file or h-file slot of the cgen-unit, and if it is #f, use the value of name slot and appends an extension .c or .h.

Generic Function: cgen-unit-init-name cgen-unit

{gauche.cgen} Returns a string that names the initialization function generated to C. It is used to create the default init-prologue value.

Filling the content

There are four parts to which you can add C code fragment. Within each part, code fragments are rendered in the same order as added.


This part is put into the header file, if exists.


Placed at the beginning of the C source, after the standard prologue.


Placed in the C source, following the ’decl’ part.


Placed inside the initialization function, which appears at the end of the C source.

The following procedures are the simple way to put a souce code fragments in an appropriate part:

Function: cgen-extern code …
Function: cgen-decl code …
Function: cgen-body code …
Function: cgen-init code …

{gauche.cgen} Put code fragments code … to the appropriate parts. Each fragment must be a string.

This is a minimal example to show the typical usage. After running this code you’ll get my-cfile.c and my-cfile.h in the current directory.

(use gauche.parameter)
(use gauche.cgen)

(define *unit* (make <cgen-unit> :name "my-cfile"))

(parameterize ([cgen-current-unit *unit*])
  (cgen-decl "#include <stdio.h>")
  (cgen-init "printf(stderr, \"initialization function\\n\");")
  (cgen-body "void foo(int n) { printf(stderr, \"got %d\\n\", n); }")
  (cgen-extern "void foo(int n);")

(cgen-emit-c *unit*)
(cgen-emit-h *unit*)

These are handy escaping procedures; they are useful even if you don’t use other parts of the cgen modules.

Function: cgen-safe-name string
Function: cgen-safe-name-friendly string
Function: cgen-safe-string string
Function: cgen-safe-comment string

{gauche.cgen} Escapes characters invalid in C identifiers, C string literals or C comments.

With cgen-safe-name, characters other than ASCII alphabets and digits are converted to a form _XX, where XX is hexadecimal notation of the character code. (Note that the character _ is also converted.) So the returned string can be used safely as a C identifier. The mapping is injective, that is, if the source strings differ, the result string always differ.

On the other hand, cgen-safe-name-friendly convers the input string into more readable C identifier. -> becomes _TO (e.g. char->integer becomes char_TOinteger), other - and _ become _, ? becomes P (e.g. char? becomes charP), ! becomes X (e.g. set! becomes setX), < and > become _LT and _GT respectively. Other special characters except _ are converted to _XX as in cgen-safe-name. The mapping is not injective; e.g. both read-line and read_line map to read_line. Use this only when you think some human needs to read the generated C code (which is not recommended, by the way.)

If you want to write out a Scheme string as a C string literal, you can use cgen-safe-string. It escapes control characters and non-ascii characters. If the Scheme string contains a character beyond ASCII, it is encoded in Gauche’s native encoding. (NB: It also escapes ?, to avoid accidenal formation of C trigraphs).

Much simpler is cgen-safe-comment, which just converts /* and */ into / * and * / (a space between those two characters), so that it won’t terminate the comment inadvertently. (Technically, escaping only */ suffice, but some simple-minded C parser might be confused by /* in the comments). The conversion isn’t injective as well.

(cgen-safe-name "char-alphabetic?")
  ⇒ "char_2dalphabetic_3f"
(cgen-safe-name-friendly "char-alphabetic?")
  ⇒ "char_alphabeticP"
(cgen-safe-string "char-alphabetic?")
  ⇒ "\"char-alphabetic\\077\""

(cgen-safe-comment "*/*"
  ⇒ "* / *"

If you want to conditionalize a fragment by C preprocessor #ifdefs, use the following macro:

Macro: cgen-with-cpp-condition cpp-expr body …

{gauche.cgen} Code fragments submitted in body … are protected by #if cpp-expr and #endif.

If cpp-expr is a string, it is emitted literally:

(cgen-with-cpp-condition "defined(FOO)"
  (cgen-init "foo();"))

;; will generate:
#if defined(FOO)
#endif /* defined(FOO) */

You can also construct cpp-expr by S-expr.

<cpp-expr> : <string>
           | (defined <cpp-expr>)
           | (not <cpp-expr>)
           | (<n-ary-op> <cpp-expr> <cpp-expr> ...)
           | (<binary-op> <cpp-expr> <cpp-expr>)

<n-ary-op> : and | or | + | * | - | /

<binary-op> : > | >= | == | < | <= | !=
            | logand | logior | lognot | >> | <<


(cgen-with-cpp-condition '(and (defined FOO)
                               (defined BAR))
  (cgen-init "foo();"))

;; will generate:
#if ((defined FOO)&&(defined BAR))
#endif /* ((defined FOO)&&(defined BAR)) */

You can nest cgen-with-cpp-condition.

Submitting code fragments for more than one parts

When you try to abstract code generation process, calling individual procedures for each parts (e.g. cgen-body or cgen-init) becomes tedious, since such higher-level constructs are likely to require generating code fragments to various parts. Instead, you can create a customized class that handles submission of fragments to appropriate parts.

Class: <cgen-node>

{gauche.cgen} A base class to represent a set of code fragments.

The state of C preprocessor condition (set by with-cgen-cpp-condition) is captured when an instance of the subclass of this class is created, so generating appropriate #ifs and #endifs are automatically handled.

You subclass <cgen-node>, then define method(s) to one or more of the following generic functions:

Generic Function: cgen-emit-xtrn cgen-node
Generic Function: cgen-emit-decl cgen-node
Generic Function: cgen-emit-body cgen-node
Generic Function: cgen-emit-init cgen-node

{gauche.cgen} These generic functions are called during writing out the C source within cgen-emit-c and cgen-emit-h. Inside these methods, anything written out to the current output port goes into the output file.

While generating .h file by cgen-emit-h, cgen-emit-xtrn method for all submitted nodes are called in order of submission.

While generating .c file by cgen-emit-c, cgen-emit-decl method for all submitted nodes are called first, then cgen-emit-body method, then cgen-emit-init method.

If you don’t specialize any one of these method, it doesn’t generate code in that part.

Once you define your subclass and create an instance, you can submit it to the current cgen unit by this procedure:

Function: cgen-add! cgen-node

{gauche.cgen} Submit cgen-node to the current cgen unit. If the current unit is not set, cgen-node is simply ignored.

In fact, the procedures cgen-extern, cgen-decl, cgen-body and cgen-init are just a convenience wrapper to create an internal subclass specialized to generate code fragment only to the designated part.

Next: , Previous: , Up: Generating C code   [Contents][Index]

9.3.2 Generating Scheme literals

Sometimes you want to refer to a Scheme constant value in C code. It is trivial if the value is a simple thing like Scheme boolean (SCM_TRUE, SCM_FALSE), characters (SCM_MAKE_CHAR(code)), small integers (SCM_MAKE_INT(value)), etc. You can directly write it in C code. However, once you step outside of these simple values, it gets tedious quickly, involving static data declarations and/or runtime initialization code.

For example, to get a Scheme value of a list of symbols (a b c), you have to (1) create ScmStrings for the names of the symbols, (2) pass them to Scm_Intern to get Scheme symbols, then (3) call Scm_Conses (or a convenience macro SCM_LIST3) to build a list.

With gauche.cgen, those code can be generated automatically.

NOTE: If you use cgen-literal, make sure you call (cgen-decl "#include <gauche.h>") to include gauche.h before the first call of cgen-literal, which may insert declarations that needs gauche.h.

Function: cgen-literal obj

{gauche.cgen} Returns an <cgen-literal> object for a Scheme object obj, and submit necessary declarations and initialization code to the current cgen unit.

For the above example, you can just call (cgen-literal '(a b c)) and the C code to set up the Scheme literal of the list of three symbols will be generated.

The result of cgen-literal is an instance of <cgen-literal>; the detail of the class isn’t for public use, but you can use it to refer the created literal in C code.

Generic Function: cgen-cexpr cgen-literal

{gauche.cgen} Returns a C code expression fragment of type ScmObj, which represents the Scheme literal value.

The following example creates a C function printabc that prints the literal value (a b c), created by cgen-literal.

(define *unit* (make <cgen-unit> :name "foo"))
(parameterize ((cgen-current-unit *unit*))
  (let1 lit (cgen-literal '(a b c))
     (format "void printabc() { Scm_Printf(SCM_CUROUT, \"%S\", ~a); }"
             (cgen-c-name lit)))))
(cgen-emit-c *unit*)

If you examine the generated file foo.c, you’ll get a general idea of how it is handled.

One advantage of cgen-literal is that it tries to share the same literal whenever possible. If you call (cgen-literal '(a b c)) twice in the same cgen unit, you’ll get one instance of cgen-literal. If you call (cgen-literal '(b c)) then, it will share the tail of the original list (a b c). So you can just use cgen-literal whenever you need to have Scheme literal values, without worrying about generating excessive amount of duplicated code.

Certain Scheme objects cannot be generated as a literal; for example, an opened port can’t, since it carries lots of runtime information.

(There’s a machinery to allow programmers to extend the cgen-literal behavior for new types. The API isn’t fixed yet, though.)

Next: , Previous: , Up: Generating C code   [Contents][Index]

9.3.3 Conversions between Scheme and C

In the C world, any Scheme object is uniformly of type ScmObj. But it is often the case that you need to narrow down to the specific type and convert it to a C value. Gauche maintains a database of how to typecheck and map Scheme value to C value and vice versa.

Note that the mapping isn’t one-to-one: Scheme <integer> can be mapped to C’s short, long, unsigned int, or even just ScmObj if the C routine wants to cover bignums. So each mapping has its own name. For historical reasons, each mapping is called stub type. The names of stub types look like Scheme type but its semantics differ from Scheme type. Remember: Each stub type represents a specific mapping between a Scheme type and a C type.

Each stub type has a C-predicate, a boxer and an unboxer, each of them is a Scheme string for the name of a C function or C macro. A C-predicate takes ScmObj object and returns C boolean value that if the given object has a valid type and range for the stub type. A boxer takes C object and converts it to a Scheme object; it usually involves wrapping or boxing the C value in a tagged pointer or object, hence the name. An unboxer does the opposite: takes a Scheme object and convert it to a C value. The Scheme object must be checked by the C-predicate before being passed to the unboxer.

The following table shows the predefined stub types. Note that the most of aggregate types has one to one mappings. The difficult ones are numeric types and strings. Scheme numbers can represent much wider range of numbers than C, so you have to narrow down according to the capability of C routine. Scheme strings have byte size and character length, and the body may not be NULL-terminated; so the <string> stub type maps Scheme string to ScmString*. For the convenience, you can use <const-cstring>, which creates NUL-terminated C string; beware that it may incur some copying cost.

Stub type    Scheme       C           Notes
<fixnum>     <integer>    int         Integers within fixnum range
<integer>    <integer>    ScmObj      Any exact integers
<real>       <real>       double      Value converted to double
<number>     <number>     ScmObj      Any numbers

<int>        <integer>    int         Integers representable in C
<int8>       <integer>    int
<int16>      <integer>    int
<int32>      <integer>    int
<short>      <integer>    short
<long>       <integer>    long
<uint>       <integer>    uint        Integers representable in C
<uint8>      <integer>    uint
<uint16>     <integer>    uint
<uint32>     <integer>    uint
<ushort>     <integer>    ushort
<ulong>      <integer>    ulong
<float>      <real>       float       Unboxed value casted to float
<double>     <real>       double      Alias of <real>

<boolean>    <boolean>    int         Boolean value
<char>       <char>       ScmChar     Note: not a C char

<void>       -            void        (Used only as a return type.
                                        Scheme function returns #<undef>)

<string>     <string>     ScmString*  Note: not a C string

<const-cstring> <string>  const char* For arguments, string is unboxed
                                      by Scm_GetStringConst.
                                      For return values, C string is boxed
                                      by SCM_MAKE_STR_COPYING.

<const-cstring-safe> <string> const char*  Like <const-cstring>,
                                      but when converting from Scheme,
                                      reject a string with NUL chars in it.

<pair>       <pair>       ScmPair*
<list>       <list>       ScmObj
<string>     <string>     ScmString*
<symbol>     <symbol>     ScmSymbol*
<keyword>    <keyword>    ScmKeyword*
<vector>     <vector>     ScmVector*
<uvector>    <uvector>    ScmUVector*
<s8vector>   <s8vector>   ScmS8Vector*
<u8vector>   <u8vector>   ScmU8Vector*
<s16vector>  <s16vector>  ScmS16Vector*
<u16vector>  <u16vector>  ScmU16Vector*
<s32vector>  <s32vector>  ScmS32Vector*
<u32vector>  <u32vector>  ScmU32Vector*
<s64vector>  <s64vector>  ScmS64Vector*
<u64vector>  <u64vector>  ScmU64Vector*
<f16vector>  <f16vector>  ScmF16Vector*
<f32vector>  <f32vector>  ScmF32Vector*
<f64vector>  <f64vector>  ScmF64Vector*

<hash-table> <hash-table> ScmHashTable*
<tree-map>   <tree-map>   ScmTreeMap*

<char-set>   <char-set>   ScmCharSet*
<regexp>     <regexp>     ScmRegexp*
<regmatch>   <regmatch>   ScmRegMatch*
<port>       <port>       ScmPort*
<input-port>  <input-port> ScmPort*
<output-port> <output-port> ScmPort*
<procedure>  <procedure>  ScmProcedure*
<closure>    <closure>    ScmClosure*
<promise>    <promise>    ScmPromise*

<class>      <class>      ScmClass*
<method>     <method>     ScmMethod*
<module>     <module>     ScmModule*
<thread>     <thread>     ScmVM*
<mutex>      <mutex>      ScmMutex*
<condition-variable> <condition-variable> ScmConditionVariable*

A stub type can have a maybe variation, denoted by ? suffix; e.g. <string>?. It is a union type of the base type and boolean false (for <string>?, it can be either <string> or #f.) In the C world, boolean false is mapped to NULL pointer. It is convenient to pass a C value that allowed to be NULL back and forth—if you pass #f from the Scheme world it comes out NULL to the C world, and vice versa. The maybe variation is only meaningful when the C type is a pointer type.

Class: <cgen-type>

{gauche.cgen} An instance of this class represents a stub type. It can be looked up by name such as <const-cstring> by cgen-type-from-name.

Function: cgen-type-from-name name

{gauche.cgen} Returns an instance of <cgen-type> that has name. If the name is unknown, #f is returned.

Function: cgen-box-expr cgen-type c-expr
Function: cgen-unbox-expr cgen-type c-expr
Function: cgen-pred-expr cgen-type c-expr

{gauche.cgen} c-expr is a string denotes a C expression. Returns a string of C expression that boxes, unboxes, or typechecks the c-expr according to the cgen-type.

;; suppose foo() returns char*
 (cgen-type-from-name '<const-cstring>)

Previous: , Up: Generating C code   [Contents][Index]

9.3.4 CiSE - C in S expression

Some low-level routines in Gauche are implemented in C, but they’re written in S-expression. We call it “C in S expression”, or CiSE.

The advantage of using S-expression is its readability, obviously. Another advantage is that it allows us to write macros as S-expr to S-expr translation, just like the legacy Scheme macros. That’s a powerful feature—effectively you can extend C language to suit your needs.

The gauche.cgen.cise module provides a set of tools to convert CiSE code into C code to be passed to the C compiler. It also has some support to overcome C quirks, such as preparing forward declarations.

Currently, we don’t do rigorous check for CiSE; you can pass a CiSE that yields invalid C code, which will cause the C compiler to emit errors. The translater inserts line directives by default so the C compiler error message points to the location of original (CiSE) source instead of generated code; however, sometimes you need to look at the generated code to figure out what went wrong. We hope this will be improved in future.

In Gauche source code, CiSE is extensively used in precompiled Scheme files and recognized by the precompiler (precomp). However, gauche.cgen.cise is an independent module only relies on gauche.cgen basic features, so you can plug it to your own C code generating programs.

Next: , Previous: , Up: C in S expression   [Contents][Index] CiSE overview

Before diving into the details, it’s easier to grasp some basic concepts.

A CiSE fragment is an S-expression that follows CiSE syntax (see CiSE syntax). A CiSE fragment can be translated to C code by cise-render to a C code fragment. Note that some translation may not be local, meaning it may want to emit forward declarations before other C code fragments. So, the full translation requires buffering—you process all the CiSE fragments and saves output, emit forward declarations, then emit the saved C code fragments. We have a wrapper procedure, cise-translate, to take care of it, but for your purpose you may want to roll your own wrapper.

A CiSE macro is a Scheme code that translates a CiSE fragment to another CiSE fragment. There are number of predefined CiSE macros. You can add your own CiSE macros by utilities such as define-cise-stmt and define-cise-expr.

A CiSE ambient is a bundle of information that affects fragment translation. It contains CiSE macro definitions, and also it keeps track of forward declarations.

Next: , Previous: , Up: C in S expression   [Contents][Index] CiSE syntax

Previous: , Up: C in S expression   [Contents][Index] CiSE procedures

Parameter: cise-ambient


Function: cise-default-ambient


Function: cise-ambient-copy ambient


Function: cise-ambient-decl-strings ambient


Parameter: cise-emit-source-line


Function: cise-render cise-fragment :optional port context


Function: cise-render-to-string cise-fragment :optional context


Function: cise-render-rec cise-fragment stmt/expr env


Function: cise-translate inp outp :key environment


Function: cise-register-macro! name expander :optional ambient


Function: cise-lookup-macro name :optional ambient


Macro: define-cise-stmt name [env] clause … [:where definition …]
Macro: define-cise-expr name [env] clause … [:where definition …]
Macro: define-cise-toplevel name [env] clause … [:where definition …]


Macro: define-cise-macro (name form env) body …
Macro: define-cise-macro name name2


Previous: , Up: C in S expression   [Contents][Index]