\name{struct}
\alias{struct}
\alias{pad}
\alias{as.struct}
\alias{is.struct}
\alias{as.list.Ctype}
%- Also NEED an '\alias' for EACH other topic documented here.
\title{
Construct a Ctype struct
}
\description{
Construct arbitarily complex \sQuote{struct}ures
in R for use with on-disk C struct's.
}
\usage{
struct(..., bytes, offset)

is.struct(x)
}
%- maybe also 'usage' for other objects documented here.
\arguments{
  \item{\dots}{
Field types contained in struct.
}
  \item{bytes}{
The total number of bytes in the struct. See details.
}
  \item{offset}{
The byte offset of members of the struct. See details.
}
  \item{x}{
object to test 
}
}
\details{
\code{struct} provides a high level \R
based description of a C based struct
data type on disk.

The types of data that can be contained within
a structure (byte array) on disk can be any
permutation of the following: int8, uint8, int16
uint16, int32, real32, and real64.  \sQuote{struct}s
are not recursive, that is all struct's contained
within a struct must be logically flattened (core
elements extracted).

All C types are converted to the appropriate \R
type internally.

It is best to consider a struct a simple byte array,
where at specified offsets, a valid C variable type
exists.  Describing the struct using the \R
function \code{struct} allows mmap extraction
to proceed as if the entire structure was one block,
(a single \sQuote{i} value), and each block
of bytes can thus be read into R with one
operation.

One important distinction between the R struct (and the examples that follow)
and a C struct is related to byte-alignment.  Note that the \R version
is effectively serializing the data, without padding to word boundaries.  See the
following section on ANSI C for more details for reading data generated by an external process
such as C/C++.
}
\value{
A list of values, one element for each
type of \R data.
}
\section{ANSI_C}{
ANSI C struct's will typically have padding in cases where required
by the language details and/or C programs.  In general, if the struct
on disk has padding, the use of \code{bytes} and \code{offset} are required
to maintain alignment with the extraction and replacement code in mmap for \R.

A simple example of this is where you have an 8-byte double (real64) and
a 4-byte integer (int32).  Created by a C/C++ program, the result will be
a 16-byte struct - where the final 4-bytes will be padding.

To accomodate this from mmap, it is required to specify the corrected
\code{bytes} (e.g. bytes=16 in this example).  For cases where padding
is not at the end of the struct (e.g. if an additional 8-byte double was
added as the final member of the previous struct), it would also
be necessary to correct the offset to reflect the internal padding. Here,
the correct setting would be \code{offset=c(0,8,16)} - since the 4-byte
integer will be padded to 8-bytes to allow for the final double to
begin on a word boundary (on a 64 bit platform).

This is a general mechanism to adjust for offset - but requires knowledge
of both the struct on disk as well as the generating process. At some point
in the near future \code{struct} will attempt to properly adjust for
offset if mmap is used on data created from outside of R.

It is important to note that this alignment is also dependent on the underlying
hardware word size (size_t) and is more complicated than the above example.
}
\references{
\url{https://en.wikipedia.org/wiki/Struct_(C_programming_language)}
\url{https://en.wikipedia.org/wiki/Data_structure_alignment}
}
\author{
Jeffrey A. Ryan
}
\note{
\sQuote{struct}'s can be thought of as \sQuote{rows}
in a database.  If many different types need always
be returned together, it will be more efficient to
store them together in a struct on disk. This reduces
the number of page hits required to fetch all required
data.  Conversley, if individual columns are desired
it will likely make sense to simply store vectors
in seperate files on disk and read in with \code{mmap}
individually as needed.

Note that not all behavior of struct extraction and replacement
is defined for all virtual and real types yet. This is
an ongoing development and will be completed in the near future.
}

\seealso{
\code{\link{types}}
}
\examples{
tmp <- tempfile()

f <- file(tmp, open="ab")
u_int_8 <- c(1L, 255L, 22L)  # 1 byte, valid range 0:255
int_8 <- c(1L, -127L, -22L)  # 1 byte, valid range -128:127
u_int_16 <- c(1L, 65000L, 1000L) # 2 byte, valid range 0:65+k
int_16 <- c(1L, 25000L, -1000L) # 2 byte, valid range -32k:32k
int_32 <- c(98743L, -9083299L, 0L) # 4 byte, standard R integer
float_32 <- c(9832.22, 3.14159, 0.00001)
cplx_64 <- c(1+0i, 0+8i, 2+2i)

# not yet supported in struct
char_ <- writeBin(as.raw(1:3), raw())
fixed_width_string <- c("ab","cd","ef")

for(i in 1:3) {
  writeBin(u_int_8[i],  f, size=1L)
  writeBin(int_8[i],    f, size=1L)
  writeBin(u_int_16[i], f, size=2L)
  writeBin(int_16[i],   f, size=2L)
  writeBin(int_32[i],   f, size=4L)
  writeBin(float_32[i], f, size=4L)  # store as 32bit - prec issues
  writeBin(float_32[i], f, size=8L)  # store as 64bit
  writeBin(cplx_64[i],  f)
  writeBin(char_[i], f)
  writeBin(fixed_width_string[i], f)
}
close(f)

m <- mmap(tmp, struct(uint8(),
                      int8(),
                      uint16(),
                      int16(), 
                      int32(),
                      real32(),
                      real64(),
                      cplx(),
                      char(),  # also raw()
                      char(2)  # character array of n characters each
                     ))  
length(m) # only 3 'struct' elements
str(m[])

m[1:2]

# add a post-processing function to convert some elements (rows) to a data.frame
extractFUN(m) <- function(x,i,...) {
                   x <- x[i]
                   data.frame(u_int_8=x[[1]],
                                int_8=x[[2]],
                               int_16=x[[3]],
                               int_32=x[[4]],
                               float_32=x[[5]],
                               real_64=x[[6]]
                             )
                 }
m[1:2]
munmap(m)

# grouping commonly fetched data by row reduces
# disk IO, as values reside together on a page
# in memory (which is paged in by mmap). Here
# we try 3 columns, or one row of 3 values.
# note that with structs we replicate a row-based
# structure.
#
# 13 byte struct
x <- c(writeBin(1L, raw(), size=1),
       writeBin(3.14, raw(), size=4),
       writeBin(100.1, raw(), size=8))
writeBin(rep(x,1e6), tmp)
length(x)
m <- mmap(tmp, struct(int8(),real32(),real64()))
length(m)
m[1]

# create the columns in seperate files (like a column
# store)
t1 <- tempfile()
t2 <- tempfile()
t3 <- tempfile()
writeBin(rep(x[1],1e6), t1)
writeBin(rep(x[2:5],1e6), t2)
writeBin(rep(x[6:13],1e6), t3)

m1 <- mmap(t1, int8())
m2 <- mmap(t2, real32())
m3 <- mmap(t3, real64())
list(m1[1],m2[1],m3[1])

i <- 5e5:6e5

# note that times are ~3x faster for the struct
# due to decreased disk IO and CPU cost to process
system.time(for(i in 1:100) m[i])
system.time(for(i in 1:100) m[i])
system.time(for(i in 1:100) list(m1[i],m2[i],m3[i]))
system.time(for(i in 1:100) list(m1[i],m2[i],m3[i]))
system.time(for(i in 1:100) {m1[i];m2[i];m3[i]}) # no cost to list()

# you can skip struct members by specifying offset and bytes
m <- mmap(tmp, struct(int8(),
                     #real32(),   here we are skipping the 4 byte float
                      real64(),
                      offset=c(0,5), bytes=13))
# alternatively you can add padding directly
n <- mmap(tmp, struct(int8(), pad(4), real64()))

pad(4)
pad(int32())

m[1]
n[1]

munmap(m)
munmap(n)
munmap(m1)
munmap(m2)
munmap(m3)
unlink(t1)
unlink(t2)
unlink(t3)
unlink(tmp)

}
\keyword{ programming }
\keyword{ IO }
\keyword{ iteration }
