unf

Computes  a Universal Numeric Fingerprint for a single vector in an 
ASCII text file given as standard input. 

_Usage_

program unf, plugin using("../stata/unf.plugin")
plugin call unf varlist, digits(num)


Note: This takes a single vector only. The type of the vector must
be specified correctly for a correct UNF. Missing values must be encoded as '.'
for numeric types, and as a blank line for character types. 

This is a standalone interface to the UNF library.
For a more flexible interface to create UNF's see the R package, UNF.


_Details_

     A universal numeric fingerprint is used to guarantee that a
     defined subset of data is substantively identical to a comparison
     subset. Two fingerprints will match if and only if the subset of
     data generating them are identical, when represented using a given
     number of significant digits.


     A UNF is created by rounding data values (or truncating strings) 
     to a known number of  digits (characters), representing those
     values in standard form (as 32bit unicode-formatted strings), and
     applying a fingerprinting method (such as cryptographic hashing
     function) to this representation.   UNF's are computed from data
     values provided by the statistical package, so they directly
     reflect the internal representation of the data -  the data as the
     statistical package interprets it.

     A UNF differs from an ordinary file checksum in several important
     ways:

     1. _UNF's are format independent._  The UNF for the data will be
     the same regardless of whether the data is saved as a R binary
     format,  SAS formatted file, Stata formatted file, etc., but file
     checksums will differ.

     2. _UNF's are robust to insignificant rounding error._  A UNF will
     also be the same if the data differs in non-significant digits, a
     file checksum not.

     3._UNF's detect misinterpretation of the data by the statistical
     software._  If the statistical software misreads the file, the
     resulting UNF will not match the original, but the file checksums
     may match.

     4._UNF's are strongly tamper resistant._ Any accidental or
     intentional changes to the data values will change the resulting
     UNF. Most file checksums's and  descriptive statistics detect only
     certain types of changes. 

     UNF libraries are available for standalone use, for use in C++,
     and for use with other packages.


     Returns a character string representing the UNF computed from the
     data. For example: UNF:3:6:ZNQRI14053UZq389x0Bffg==

     This representation identifies the signature as a fingerprint,
     using version 3, of the algorithm,  computed to 6 significant
     digits. The segment following the final colon is the actual
     fingerprint in base64 encoded format.

     Note: to compare two UNF's, or sets of unfs, one often wants to
     compare only the base64 portion. Use 'as.character' for this,
     which will extract the base64 portion. Use 'summary' to produce a
     single UNF from set of vectors.

_Author_

     Micah Altman Micah_Altman@harvard.edu

     <URL: http://www.hmdc.harvard.edu/micah_altman/>

_References_

     Altman, M., J. Gill and M. P. McDonald.  2003.  _Numerical Issues
     in Statistical Computing for the Social Scientist_.  John Wiley &
     Sons. <URL: http://www.hmdc.harvard.edu/numerical_issues/>

