src/parsec/disk-image/parsec/parsec-benchmark/pkgs/netapps/netferret/src/server/doc/manual/library.texi - public/gem5-resources - Git at Google

 @menu
 * Using the Ferret Library
 * Portability
 * Error Handling
 * Utility Routines
 * Data Types
 * Vector and Vecset Distances
 * Tables
 * Multiple Modality Support
 * Vector Distance Reference
 * Vecset Distance Reference
 * Index and Sketch Reference
 @end menu

 @node Using the Ferret Library
 @section Using the Ferret Library
 After successful installation, you only need to put the following line to your
 code before the first reference to the Ferret Library data types or routines.
 @smallexample
 	#include <cass.h>
 @end smallexample
 To link against the Ferret Library, add the option @option{-lcass} to the
 command line of @command{ld} or @command{gcc}.

 @node Portability
 @section Portability

 For portability purpose, the following primitive types are used in the Ferret
 Library, so that the compile to the same size on all machines.

 @smallexample
 int32_t
 uint32_t
 uint64_t
 uchar
 @end smallexample

 The database files are portable between 32-bit and 64-bit machines, but not
 between machines of different endians for now.  To make it portable between
 different endians, only the file IO module (@file{cass_file.c}) needs to be modified.

 @node Error Handling
 @section Error Handling

 Most of the Ferret Library routines return 0 for success and a non-zero error
 code if some error happens.  Following is a list of possible error codes.

 @smallexample
 CASS_ERR_OUTOFMEM
 CASS_ERR_MALFORMATVEC
 CASS_ERR_PARAMETER
 CASS_ERR_IO
 CASS_ERR_CORRUPTED
 @end smallexample

 The following function converts an error code into a string for display.
 @table @asis
 @item @emph{Prototype:}
 @code{const char* cass_strerror (int err);}
 @end table

 @node Utility Routines
 @section Utility Routines
 This section documents the general utility data structures and routines that are
 not perticularly related to nearest neighbor search, but will make program
 easier.  You do not need to initialize the library to use the routines
 documented in this section.

 @subsection Dynamic arrrays
 Maintaining a dynamic (growable) array is always a pain in the plain C
 language.  The Ferret Library provides a set of macros to ease the programmer's
 life.  A dynamic array of type @code{@emph{T}} is declared as the following struct:
 @smallexample
 	struct @{
 	        cass_size_t inc;
 	        cass_size_t size;
 	        cass_size_t len;
 	        @emph{T} *data
 	@} @emph{array};
 @end smallexample
 where @code{len} is the number of elements actually in the array and @code{size} is
 the size of the memory allocated, pointed to by @code{data}.  When the array
 needs to grow, @code{size} is incremented by @code{inc}, and the memory is
 reallocated.  All @code{inc}, @code{size} and @code{len} are numbers of
 elements instead of actual bytes.  To access the @code{i}-th element of
 @code{@emph{array}}, use @code{@emph{array}.data[i]}.

 Any structure declared following the above template can be handled by the macros
 documented in this section, despite the type of @code{@emph{T}} and the actual
 name of the struct.  Using the array declaration macro @code{ARRAY_TYPE}, the
 declaration of an array can be as easy as
 @smallexample
 	ARRAY_TYPE(@emph{T}) @emph{array};
 @end smallexample

 The following small example shows the operations of an array of @code{char}.

 @smallexample
 	ARRAY_TYPE(char) s;
 	ARRAY_INIT(s);

 	ARRAY_APPEND(s, 'A');
 	ARRAY_APPEND(s, 'B');
 	ARRAY_GET(s, 0);        /* should be 'A' */
 	ARRAY_LEN(s);           /* should be 2 */

 	/* enumerate the array */
 	ARRAY_BEGIN_FOREACH(s, c)
 	@{
 	        putchar(c);
 	@} ARRAY_END_FOREACH(s, c);

 	ARRAY_CLEANUP(array);
 @end smallexample

 The following items are to be noticed.
 @itemize @bullet
 @item All macros accepts the array variable instead of pointer to the variable.
 @item The type of array elements can be obtained by the GCC extension
 @code{typeof @code{@emph{array}.data[0]}}
 @item The details of the array struct are considered exposed, and any changes
 made to the struct are assumed safe, so long as the following holds:
 @smallexample
 	   len <= size
 	&& inc > 0
 	&& (size == 0 || data != NULL)
 @end smallexample
 @item The array struct can safely contain other elements and they will not be
 touched by the macros.  To add extra elements to the struct, the user need to
 explicitly declare the struct, the @code{ARRAY_TYPE} macro will not help.
 @item The user is responsible for managing the dynamically allocated memory
 pointed to by the array element, should they be pointers.
 @end itemize

 @subsubsection Declaration
 @table @asis
 @item @emph{Macro}:
 @code{ARRAY_TYPE(type) @emph{array};}

 @code{@emph{array} = ARRAY_WRAPPER(data, len);}

 @item @emph{Description}:
 @code{ARRAY_TYPE} expands to the type declaration of a struct for a dynamic array of
 the given type.  @code{ARRAY_WRAPPER} makes an array variable out of a chunk of
 memory allocated by @code{malloc}, specified by the pointer to the first
 element, @code{data}, and the number of element, @code{len}.

 @end table

 @subsubsection Initialization
 @table @asis
 @item @emph{Macro}:
 @code{ARRAY_INIT(array)}

 @code{ARRAY_INIT_SIZE(array, initial_size)}
 @item @emph{Description}:
 Initialize an array.  The latter set the size of array to @code{initial_size}
 and allocates the memory of this size.
 @end table

 @subsubsection Cleanup
 @table @asis
 @item @emph{Macro}:
 @code{ARRAY_CLEANUP(array)}
 @item @emph{Description}:
 Release the dynamically allocated memory of the array.
 @end table

 @subsubsection Field access
 @table @asis
 @item @emph{Macro}:
 @code{ARRAY_INC(array)}

 @code{ARRAY_SET_INC(array, inc)}

 @code{ARRAY_LEN(array)}

 @code{ARRAY_RAW_SIZE(array)}

 @code{ARRAY_SIZE(array)}

 @code{ARRAY_EXPAND(array, new_size)}

 @item @emph{Description:}
 These macros access the various fields of the array struct.
 The macro @code{ARRAY_RAW_SIZE} returns the raw size of data in bytes actually
 present in the array, which is equal to @code{@emph{array}.len *
 sizeof(@emph{array}.data[0])}.

 @code{ARRAY_EXPAND} will grow the array if necessary so that the size of the array is at
 least @code{new_size}.

 @end table

 @subsubsection Element access
 @table @asis
 @item @emph{Macro:}
 @code{ARRAY_GET(array, i)}

 @code{ARRAY_SET(array, i, elem)}

 @code{ARRAY_APPEND(array, elem)}

 @code{ARRAY_APPEND_UNSAFE(array, elem)}

 @code{ARRAY_TRUNC(array)        /* same as array.len = 0 */}

 @code{ARRAY_MERGE(array, array2)}

 @code{ARRAY_MERGE_RAW(array, data, len)}

 @item @emph{Description:}
 For @code{ARRAY_GET} and @code{ARRAY_SET}, you need to ensure that @code{i <
 @emph{array}.len}.

 @code{ARRAY_APPEND} will automatically grow the array when
 space is not enough, but @code{ARRAY_APPEND_UNSAFE} will not.  The latter is
 faster though, and can be used when total size of the array is known and fixed
 at initialization.

 @code{ARRAY_MERGE} appends all the elements in @code{array2} to @code{array},
 growing @code(array) when necessary. @code{ARRAY_MERGE_RAW} is similar, but the
 second array is given by the pointer to the first element of the array and the number of
 elements.

 @end table

 @subsubsection Element enumeration
 @table @asis
 @item @emph{Macro:}
 @smallexample
 ARRAY_BEGIN_FOREACH(array, cursor) @{

       ...@emph{your code}...

 @} ARRAY_END_FOREACH(array, cursor);


 ARRAY_BEGIN_FOREACH_P(array, cursor_p) @{

       ...@emph{your code}...

 @} ARRAY_END_FOREACH_P(array, cursor_p);
 @end smallexample
 @item @emph{Description:}
 These two sets of macros enumerate the elements in an array.  @code{cursor(_p)} does
 not need to be declared outside the block (actually, any declaration of
 @code{cursor(_p)} outside the block will be overridden).  The former pair of
 macros passes the element in value and the latter in reference, which can reduce
 the overhead of copying the elements when they are large.  You can use any name
 for @code{cursor(_p)} so long as they match in the @code{..._BEGIN_...} and
 @code{..._END_...} macros.
 @end table

 @subsection Heaps

 The Ferret Library provides binary heap support on top of the dynamic array routines
 described in the previous section.  A heap is just a dynamic array, except that
 the elements in the array follow certain order.  The heap routines described
 below directly operate on array structs.

 @table @asis
 @item @emph{Prototype:}
 @code{HEAP_EMPTY(heap)        /*boolean, whether the heap is empty */ }

 @code{HEAP_HEAD(heap)         /*the smallest element in the heap, in value */ }

 @code{HEAP_ENQUEUE(heap, elem, ge)}

 @code{HEAP_DEQUEUE(heap, ge)  /* does not return the element */ }

 @code{HEAP_ENQUEUE_UPDATE(heap, elem, ge, update)}

 @code{HEAP_DEQUEUE_UPDATE(heap, elem, ge, update)}

 @item @emph{Description:}
 Both @code{HEAP_ENQUEUE} and @code{HEAP_DEQUEUE} require an extra parameter
 @code{ge}, the comparator, which can be either a function pointer or a macro.
 For any two variable @code{a} and @code{b} of the type of heap element,
 @code{ge(&a,&b)} should return 1 if @code{a >= b} and 0 if not.

 Sometimes, the elements of the heap need to known their own index in the heap.
 The situation is complicated because the enqueue and dequeue operations need to
 reorder some of the heap elements.  The macros @code{HEAP_ENQUEUE_UPDATE} and
 @code{HEAP_DEQUEUE_UPDATE} allow the user to track the offset of heap elements.
 The extra parameter @code{update} and be a function pointer or a macro, and for
 each element @code{a} which is rellocated to the new offset @code{i},
 @code{update(&a, i)} will be invoked.

 For @code{HEAP_DEQUEUE(_UPDATE)}, the head of the heap is thrown away silently.
 You need to call @code{HEAP_HEAD} in advance if you need to keep the dequeued
 element.

 All the heap macros described here assume the elements of the array are in
 binary heap order.  You can safely access the heap data by other array macros in
 read-only manner, but updates other than @code{ARRAY_EXPAND} are not advised
 because they may destroy the heap order.

 @end table

 @subsection Maintaining top-K elements

 Maintaining a list of the best K structs with respect to certain field is a
 common job and the Ferret Toolkit provides a set of macros to do this
 efficiently.

 The macros documented below works on an array of @code{K} structs with a comparable
 field @code{key}.

 @table @asis
 @item @emph{Prototype:}
 @code{TOPK_INIT (array, key, K, init)}
 @item @emph{Description:}
 Initialize the @code{key} field of each of the @code{K} element to value
 @code{init}.  For @code{K} maximal values, @code{init} should be minimal value
 possible, and for the minimal values the opposite.

 @item @emph{Prototype:}
 @code{TOPK_INSERT_MIN (array, key, K, elem)}
 @item @emph{Description:}
 Keep the @code{K} minimal element in @code{array} with regard to the field
 @code{key}.  If @code{elem} is smaller than the existing largest element in
 @code{array}, it replace that largest element.

 @item @emph{Prototype:}
 @code{TOPK_INSERT_MAX (array, key, K, elem)}
 @item @emph{Description:}
 Similar to @code{TOPK_INSERT_MIN}, except for it keeps the @code{K} maximal
 elements.

 @item @emph{Prototype:}
 @code{TOPK_SORT_MIN (array, key, K)}
 @item @emph{Description:}
 The above two macros use heap for high performance and the result is that the
 result array is not sorted according to @code{key}.  This macro sorts the array
 in descending order according to @code{key}.  Note that this macro can only sort
 arrays prepared with @code{TOPK_INSERT_MIN}.

 @item @emph{Prototype:}
 @code{TOPK_INSERT_MIN_UNIQ (array, key, K, elem)}
 @item @emph{Description:}
 Similar to @code{TOPK_INSERT_MIN}, except it assures that every one in
 @code{array} has a unique @code{index} field.  For now, the name of the field
 @code{index} is not customizable.  This is to be fixed fixed in later versions of the
 toolkit.  To keep the uniqueness, the naive array insertion instead of heap is
 used, and the result is sorted.

 @end table

 Following is a small example.  Given the query point id $code{q} and the set of
 data point id @code{data[]}, we want to get the @code{K} points in @code{data[]}
 that is closest to @code{q}.

 @smallexample
 	#define K 10

 	typedef struct @{
 	        float key;
 	        int index;
 	@} foo_t;

 	foo_t min[K], foo;

 	TOPK_INIT(min, key, K, MAXFLOAT);

 	for (i = 0; i < data_len; i++)
 	@{
 	       foo.key = dist(q, data[i]);
 	       foo.index = i;
 	       TOPK_INSERT_MIN(min, key, K, foo);
 	@}

 	/* sort the result*/
 	TOPK_SORT_MIN(min, key, K);
 @end smallexample

 @subsection Portable file IO

 @table @asis
 @item @emph{Prototype:}
 @code{typedef struct ... CASS_FILE;}

 @code{CASS_FILE *cass_open (const char *path, const char *mode);}

 @code{void cass_close (CASS_FILE *)};

 @code{/* @emph{type} can be one of @{ int32, uint32, uint64, size, float,
 	 double, char */}

 @code{int cass_read_@emph{type} (@emph{type} *, size_t nmemb, CASS_FILE *);}

 @code{int cass_write_@emph{type} (const @emph{type} *, size_t nmemb, CASS_FILE *);}

 @code{char *cass_read_pchar (CASS_FILE *);}

 @code{int cass_write_pchar (const char *, CASS_FILE *);}

 @item @emph{Description:}
 These routines intend to be a portable way of doing I/O, so that the file
 produced from a machine of one architecture can be read by a machine of another
 architecture.  For now, the portability is not realized and these routines are
 just a wrapper of stdio.  @code{CASS_FILE} is same as @code{FILE} for now.
 @end table

 @subsection Timer
 @table @asis
 @item @emph{Prototype:}
 @code{typedef struct ... stimer_t};

 @code{void stimer_tick (stimer_t *timer);}

 @code{float stimer_tuck (stimer_t *timer, const char *msg);}

 @item @emph{Description:}
 These routines are used to measure the wall time used by certain piece of code.
 @code{stimer_tick} initialize the @code{timer} struct and record the start time,
 @code{stimer_tuck} stop timing and returns the elapsed time.  If @code{msg} is
 not @code{NULL}, the elapsed time is printed to @code{stdout} together with
 @code{msg}.

 @end table

 @subsection Replacement of @code{malloc}, @code{calloc} and @code{realloc}
 @table @asis
 @item @emph{Prototype:}
 	@code{@emph{type *} type_alloc (@emph{type});}

 	@code{@emph{type *} type_calloc (@emph{type}, cass_size_t len);}

 	@code{@emph{type *} type_realloc (@emph{type}, @emph{type} *ptr, cass_size_t len);}
 @item @emph{Description:}
 These macros are just wrappers of @code{malloc}, @code{calloc} and
 @code{realloc}.  They accept @code{@emph{type}} instead of
 @code{sizeof(@emph{type})}, and return pointer of type @code{@emph{type *}}
 instead of @code{void *}.
 @end table

 @subsection Vectors and Matrices
 Two dimensional array, or matrix, of size @code{M*N} are stored in memory in the following way.
 First, a chunk of memory to hold @code{M*N} elements are allocated.  Then a
 chunk of memory to hold @code{M} pointer to the rows are allocated, and
 initialized to the start memory address of each row.  @code{matrix3} is the
 3-dimensional array.

 @table @asis
 @item @emph{Prototype:}

 	@code{@emph{type **} type_matrix_alloc (@emph(type), cass_size_t row,
 			cass_size_t col);}

 	@code{void matrix_free (@emph{type} **matrix);}

 	@code{@emph{type ***} type_matrix3_alloc (@emph(type), cass_size_t num,
 			cass_size_t row, cass_size_t col);}

 	@code{void matrix3_free (@emph{type} ***matrix);}
 @item @emph{Description:}
 	Allocate/free memory for the matrix.


 @item @emph{Prototype:}
 	@code{int type_matrix_load_stream (@emph{type}, FILE *stream,
 			cass_size_t *row, cass_size_t *col, @emph{type}
 			***matrix);}

 	@code{int type_matrix_load_file (@emph{type}, const char *path,
 			cass_size_t *row, cass_size_t *col, @emph{type}
 			***matrix);}

 	@code{int type_matrix_dump_stream (@emph{type}, FILE *stream,
 			cass_size_t row, cass_size_t col, @emph{type}
 			**matrix);}

 	@code{int type_matrix_dump_file (@emph{type}, const char *path,
 			cass_size_t row, cass_size_t col, @emph{type}
 			**matrix);}
 @item @emph{Description:}
 	Read and write matrix, either with a stream or a file.  These routines
 	are not portable and are to be modified to use @code{CASS_FILE} instead.

 @item @emph{Prototype:}
 	@code{int type_matrix_map_file (@emph{type}, const char *path,
 			cass_size_t *row, cass_size_t *col, @emph{type}
 			***matrix);}

 	@code{int matrix_unmap_file (void **matrix);}
 @item @emph{Description:}
 	Memory mapping, similar to @code{type_matrix_load_file}.
 @end table

 @node Library Initialization and Cleanup
 @section Library Initialization and Cleanup

 @table @asis
 @item @emph{Prototype:}
 @code{int cass_init (void);}

 @code{int cass_cleanup (void);}
 @item @emph{Description:}
 Call @code{cass_init} before using any of the Ferret Library routines, and call
 @code{cass_cleanup} at the end of your program.
 @end table

 @node Database Environment
 @section Database Environment

 @subsection Open and close
 @table @asis
 @item @emph{Prototype:}
 @code{int cass_env_open (cass_env_t **env, char *db_home, uint32_t flags);}

 @code{int cass_env_close(cass_env_t *env, uint32_t flags);}

 @item @emph{Description:}
 The function @code{cass_env_open} opens a Ferret database for future
 use and send back the environment pointer through the first argument.
 @code{db_home} should contain the path to the directory in which contains the
 Ferret database.  @code{flags} can be one or more of the following flags.
 @table @code
 @item CASS_READONLY
 Open the database read-only.  Without this flag specifiled, the
 database will be opened read-write.
 @item CASS_EXCL
 If the directory does not exist, create the database; otherwise
 fail.
 @item
 @end table

 The function @code{cass_env_close} closes the database provided as
 the first argument.  Currently there's no flags supported and the user should
 always pass in 0.

 @end table


 @subsection Logging
 @table @asis
 @item @emph{Prototype:}
 @code{int cass_env_errmsg(cass_env_t *env, int error, const char *fmt, ...);}

 @code{int cass_env_panic(cass_env_t *env, const char *msg);}
 @item @emph{Description:}
 The function @code{cass_env_errmsg} works like @code{fprintf}, except it writes
 the message to the database log file.

 The function @code{cass_env_panic} kills the application after printing out the
 error message.
 @end table

 @subsection Checkpointing
 @table @asis
 @item @emph{Prototype:}

 @code{int cass_env_checkpoint(cass_env_t *env);}
 @item @emph{Description:}
 Flush any changes of the database since last checkpoint into disk.  This
 function may lock the database and cause significant delay to the panding
 modifications.
 @end table

 @subsection Describing
 @table @asis
 @item @emph{Prototype:}
 @code{int cass_env_describe (cass_env_t *env, CASS_FILE *);}

 @item @emph{Description:}
 This function writes the textual description of the database to the provided
 stream.
 @end table

 @node Mapping
 @section Mapping

 @node Vector, Vecset and Dataset
 @section Vector, Vecset and Dataset

 @subsection  Data structures
 A vecset is a set of vectors, and a dataset is a set of vecsets.  A vecset
 usually contains the feature vectors corresponding to an object, which is
 segmented into several part, with each part described by a vector in the vecset.
 A Ferret table is a dataset, which contains multiple vecsets.  The three
 structures are defined as follows.

 @smallexample
 /* vector */
 typedef struct @{
         float weight;
         cass_vecset_id_t parent;
         union @{
                 uchar	data[0];
                 int32_t	int_data[0];
                 float	float_data[0];
                 chunk_t	bit_data[0];
         @};
 @} cass_vec_t;

 /* vecset */
 typedef struct @{
         uint32_t num_regions;
         cass_vec_id_t start_vecid;
 @} cass_vecset_t;

 /* dataset */
 typedef struct @{
         uint32_t 		flags;
         uint32_t		loaded;
         cass_size_t             vec_size;
         cass_size_t             vec_dim;
         cass_vec_id_t           max_vec;
         cass_vec_id_t           num_vec;
         void                    *vec;
         cass_vecset_id_t        max_vecset;
         cass_vecset_id_t        num_vecset;
         cass_vecset_t           *vecset;
 @} cass_dataset_t;
 @end smallexample

 @code{cass_vec_t} and @code{cass_vecset_t} are not standalone; they depend on
 information in @code{cass_dataset_t} and even
 @xref{x-config,,@code{cass_vecset_cfg}} to determine how to interprete there
 fields.  The fields of the three structs are explained below.

 First, @code{cass_dataset_t}.
 @table @code
 @item flags
 A table can contain either vectors or vecsets, or both, and this field specifies
 what is contained in the dataset.  It can be one of the following values.
 @table @code
 @item CASS_DATASET_VEC
 The datset contains only vector data.
 @item CASS_DATASET_VECSET
 The dataset contains only vecset data.
 @item CASS_DATASET_BOTH
 The dataset contains both vectors and vecsets.
 @end table
 @item loaded
 Whether the feature data (vectors and/or vecsets) are in memory and can be
 accessed through @code{vec}/@code{vecset} field.
 @item vec_size
 Size of a vector in bytes, including a 32-bit weight and the real feature data.
 @item vec_dim
 The dimension of the feature vector.
 @item max_vec
 The maximum number of vectors the memory allocated in @code{vec} can hold.
 @item num_vec
 The real number of vectors in the dataset.
 @item vec
 Pointer to the memory holding the vectors.
 @item max_vecset
 The maximum number of vecsets the memory allocated in @code{vecset} can hold.
 @item num_vecset
 The real number of vecsets in the dataset.
 @item vecset
 Pointer to the memory holding the vecsets.
 @end table
 Second, @code{cass_vecset_t}.
 @table @code
 @item num_regions
 Number of vectors in this vecset.
 @item start_vecid
 The offset of the first vector in the dataset.  The vectors in dataset that
 belong to this vecset are those numbered from @code{start_vecid} to
 @code{(start_vecid + num_regions -1)}.  Thoses numbers are only meaningful to
 the dataset that contains the vecset.
 @end table
 Finally, @code{cass_vec_t}.
 @table @code
 @item weight
 The weight of the vector inside the vecset.
 @item parent
 The id of the vecset which contains the vector.
 @item [int_|float_|bit_]data
 This is the real data of the vector.
 Different field of the union should be
 used according to the configuration of the vecset.  Also, because the
 dimension is variable, these fields works as a place holder.
 @end table

 @code{cass_vec_t} requires a bit more explanation.  This struct actually works
 as a placeholder, and when the dimensionality of the vector is determined, the
 real data will always be larger than the struct in size.  In other words,
 @code{sizeof(cass_vec_t)} should never be used except for taking the size
 of the vector without the real feature data.  The variable size of
 @code{cass_vec_t}  makes it a little tricky to locate a perticular vector in a
 dataset.  The following piece of code illustrates how to do the job.

 @smallexample
 	/* Get the i-th vector in the dataset */
 	cass_vec_t *vec = (void *)dataset->vec + dataset->vec_size * i;
 @end smallexample

 @subsection Dataset routines
 The definition of all the above defined structures are considered explicit, and
 are safe to be modified by the user.  The following routines make common
 operations on datasets easier.

 @table @asis
 @item @emph{Prototype:}
 @code{int cass_dataset_init (cass_dataset_t *ds, cass_size_t vec_size,
 		cass_size_t vec_dim, uint32_t flags);}
 @item @emph{Description:}
 Initialize a dataset, without allocating memory to hold the feature data.

 @item @emph{Prototype:}
 @code{int cass_dataset_grow (cass_dataset_t *ds, cass_size_t num_vecset,
 		cass_size_t num_vec); }
 @item @emph{Description:}
 (Re-)allocate memory of the dataset so that it can hold at least
 	@code{num_vecset} vecsets and @code{num_vec} vectors.

 @item @emph{Prototype:}
 @code{int cass_dataset_release (cass_dataset_t *ds);}
 @item @emph{Description:}
 Release the memory of the feature data.  Not actually frees the @code{ds}
 pointer.

 @item @emph{Prototype:}
 @code{int cass_dataset_merge (cass_dataset_t *ds, const cass_dataset_t *src,
 		cass_vecset_id_t start, cass_vecset_id_t num, cass_dataset_map_t
 		map, void *map_param);}
 @item @emph{Description:}
 Merge the @code{num} vecsets start from @code{start} of the dataset @code{src}
 to the dataset @code{ds}, grow @code{ds} when necessary.  If the dataset
 	contains both vectors and vecsets, the @code{parent} field of vectors are
 	updated according to the new position of the corresponding vecsets.
 However, if the dataset contains vector only, the @code{parent} field is
 directly copied.

 @item @emph{Prototype:}
 @code{int cass_dataset_checkpoint (cass_dataset_t *ds, CASS_FILE *);}
 @code{int cass_dataset_restore (cass_dataset_t *ds, CASS_FILE *);}
 @item @emph{Description:}
 Read/write the meta data.

 @item @emph{Prototype:}
 @code{int cass_dataset_load (cass_dataset_t *ds, CASS_FILE *in);}
 @code{int cass_dataset_dump (cass_dataset_t *ds, CASS_FILE *out);}
 @item @emph{Description:}
 Read/write the feature data.

 @item @emph{Description:}

 @end table

 @node Class, Instance and Registry
 @section Class, Instance and Registry
 The Ferret Toolkit is designed such that many components work as plugins and can
 be customized.  For example, there are different index/sketch plugins that
 meet different requirements of precision, speed and storage overhead.  One
 specific index algorithm can be viewd as a class, and it is instanciated when a
 real index structure is created upon a dataset.  The vecset distance and vector
 distance also follow this paradigm.  For example, a L2 distance algorithm that
 only evaluate with a certain subset of all the dimensions is a class, and it is
 instanciated when the specific interested subset of the dimensions to use is
 provided.  A class is defined with a number of properties and a set of method,
 and a instance of a class also has a set of properties and a pointer to
 the class.  Following is an example showing the structs describing the vector distance
 class and vector distance instances.

 @smallexample
 typedef cass_dist_t (*cass_vec_dist_func_t) (cass_size_t n, void *, void *, void *);

 /* this is the class */
 typedef struct @{
         char                   *name;
         cass_vec_type_t         vec_type;
         cass_vec_dist_type_t    type;
         cass_vec_dist_func_t    dist;
         int (*describe) (void *, CASS_FILE *);
         int (*construct) (cass_vec_dist_t **, const char *);
         int (*checkpoint) (void *, CASS_FILE *);
         int (*restore) (void **, CASS_FILE *);
         void (*free) (void *);
 @} cass_vec_dist_class_t;

 /* this is the instance */
 typedef struct @{
         uint32_t        refcnt;
         char           *name;
         cass_vec_dist_class_t *class;
         /* private data... */
 @} cass_vec_dist_t;

 @end smallexample

 Given a struct @code{dist_class} of the class, an instance can be created with
 @code{dist_class->construct}, which returns the constructed instance through
 its first argument.  Because different class accept different parameters, a
 string is used to encode the parameters.  The parameter string could be
 something like @code{"-M 128 -W 3.2"}.

 See @xref{x-extend,,Extending Ferret} for more information.

 Both class and instance are identified by an internal ID and an external textual name,
 and a registry is used to map between the ID, name and the pointer to the struct.
 Each type of class/instance has its own name space, and the namespaces of class
 and instance are seperate.  For example, you can create a distance instance
 "trivial" from the distance class "trivial".

 Following are the registry routines.

 @table @asis
 @item @emph{Prototype:}
 @code{typedef struct ... cass_reg_t;}
 @item @emph{Description:}
 This is the registry type.

 @item @emph{Prototype:}
 @code{int cass_reg_init (cass_reg_t *reg);}

 @code{int cass_reg_init_size (cass_reg_t *reg, cass_size_t size);}
 @item @emph{Description:}
 Initialize a registry.  The latter initialize the registry to hold @code{size}
 entries, and is used when the size of registry is known and fixed in advance.

 @item @emph{Prototype:}
 @code{int cass_reg_cleanup (cass_reg_t *reg);}
 @item @emph{Description:}
 Cleanup the registry, release the memory.

 @item @emph{Prototype:}
 int32_t cass_reg_lookup (cass_reg_t *, const char *name);
 @item @emph{Description:}
 Map name to ID.  If the name does not exist, -1 is returned.

 @item @emph{Prototype:}
 int32_t cass_reg_find (cass_reg_t *, const void *ptr);
 @item @emph{Description:}
 Map pointer to ID.  If the pointer does not exist in the registry, -1 is
 	returned.

 @item @emph{Prototype:}
 void *cass_reg_get (cass_reg_t *, uint32_t i);
 @item @emph{Description:}
 Map ID to pointer.

 @item @emph{Prototype:}
 int cass_reg_add (cass_reg_t *, const char *name, void *);
 @item @emph{Description:}
 Add an item to the registry, return the ID of the newly added item.

 @end table

 @node Vector Distance and Vecset Distance
 @section Vector Distance and Vecset Distance

 To enable customization, the vector distance and vecset distance come with a
 parameter instead of simply a pointer to function, and the parameter and pointer
 to function are wrapped up in a struct. Following are the relative declarations
 of vector distance.  Both vector/vecset distance follow the class/instance
 model.

 Following are declarations relative to vector distance.
 @smallexample
 typedef cass_dist_t (*cass_vec_dist_func_t) (cass_size_t n, void *v1, void *v2,
 		cass_vec_dist_t *);

 typedef struct _cass_vec_dist_class {
         char			*name;
         cass_vec_type_t          vec_type;
         cass_vec_dist_type_t     type;
         cass_vec_dist_func_t     dist;

         int (*describe) (void *, CASS_FILE *);
         int (*construct) (void **, const char *param);
         int (*checkpoint) (void *, CASS_FILE *);
         int (*restore) (void **, CASS_FILE *);
         void (*free) (void *);
 } cass_vec_dist_class_t;

 typedef struct {
         uint32_t refcnt;
         char *name;
         cass_vec_dist_class_t *class;
         /* private data... */
 } cass_vec_dist_t;

 @end smallexample

 Following are the explanation of some of the fields of
 @code{cass_vec_dist_class_t}.
 @table @code
 @item vec_type
 The type of vector on which this distance is defined.
 @item type
 Type of the distance.  Can be one of the following values.
 @table @code
 @item    CASS_VEC_DIST_TYPE_TRIVIAL
 @item    CASS_VEC_DIST_TYPE_ID
 @item    CASS_VEC_DIST_TYPE_L1
 @item    CASS_VEC_DIST_TYPE_L2
 @item    CASS_VEC_DIST_TYPE_MAX
 @item    CASS_VEC_DIST_TYPE_HAMMING
 @item    CASS_VEC_DIST_TYPE_COS
 @end table
 Note that the distance type itself do not determine the distance.  For example,
      L1 distance evaluate using all the dimensions and those evaluated using a
      subset of the dimensions are all of the type @code{CASS_VEC_DIST_TYPE_L1}.

 @item dist
 The pointer to the function.
 @item describe
 Writes textual describing information to the stream.
 @item construct
 Construct the distance instance using the given textual parameter.
 @item checkpoint
 Write the meta data of distance into a stream.
 @item restore
 Construct a distance instance from a stream.
 @item free
 Free the distance instance.
 @end table

 The fields of @code{cass_vec_dist_t} actually do not need explanation.  One
 thing to note that @code{cass_vec_dist_t} is a basic type, and for different
 distance classes, there can be more specific types which can hold private
 information.  For example, if only certain subset of all dimensions are
 interested, then the distance can be defined as follows:

 @smallexample
 	typedef struct
 	@{
 		cass_vec_dist_t base;
 		ARRAY_TYPE(int) dim; /* array of interested dimensions */
 	@} cass_l1_dim_t;
 @end smallexample
 One can then safely cast @code{cass_l1_dim_t *} to @code{cass_vec_dist_t *}.


 Following are routines to access the distance classes.  They are just
 specialized version of the vector distance class registry.

 @table @asis
 @item @emph{Prototype:}
 @code{int32_t cass_vec_dist_class_lookup (const char *n);}

 @code{int32_t cass_vec_dist_class_find (const cass_vec_dist_class_t *p);}

 @code{cass_vec_dist_class_t *cass_vec_dist_class_get (uint32_t i);}
 @end table

 The declaration of vecset distance is similar.
 @smallexample
 typedef cass_dist_t (*cass_vecset_dist_func_t) (cass_dataset_t *, cass_vecset_id_t,
 		cass_dataset_t *, cass_vecset_id_t , cass_vec_dist_t *vec_dist,
 		cass_vecset_dist_t *);

 typedef struct _cass_vecset_dist_class{
 	char *name;
 	cass_vecset_type_t vecset_type;
 	cass_vecset_dist_type_t type;
 	cass_vecset_dist_func_t dist;
 	/* void ** is actually cass_vecset_dist_t ** */
 	int (*describe) (void *, CASS_FILE *);
 	int (*construct) (void **, const char *param);
 	int (*checkpoint) (void *, CASS_FILE *);
 	int (*restore) (void **, CASS_FILE *);
 	void (*free) (void *);
 	/* private data... */
 } cass_vecset_dist_class_t;

 typedef struct {
 	uint32_t refcnt;
 	char *name;
 	int32_t vec_dist;
 	cass_vecset_dist_class_t *class;
 } cass_vecset_dist_t;

 @end smallexample


 There are also routines to access the vecset distance class registry.
 @table @asis
 @item @emph{Prototype:}
 @code{int32_t cass_vecset_dist_class_lookup (const char *n);}

 @code{int32_t cass_vecset_dist_class_find (const cass_vecset_dist_class_t *p);}

 @code{cass_vecset_dist_class_t *cass_vecset_dist_class_get (uint32_t i);}
 @end table
 @node Configurations and Tables
 @section Configurations and Tables
 @anchor{x-config}

 Just like a table in a relational database has a scheme, a table in a Ferret
 database has a configuration, which is specified by the user when the table is
 created.  Following is the definition of the configuration struct.

 @smallexample

 typedef struct @{
         uint32_t                refcnt;
         char                   *name;
         cass_vecset_type_t      vecset_type;
         cass_vec_type_t         vec_type;
         cass_size_t             vec_dim;
         cass_size_t             vec_size;
         uint32_t                flags;
 @} cass_vecset_cfg_t;

 @end smallexample

 The fields are explained below.

 @table @code
 @item refcnt
 The reference count of the structure.  Unlike the scheme in the relational
 database, which is always associated with a specific table, a configuration in a
 Ferret database is a independent object, and can be shared among more than one
 tables.   The reference count is used to maintain the number of tables, or other
 objects that keep a pointer to the configuration struct.
 @item name
 The name of the configuration.  The memory of the string is owned by the struct,
 and will be freed when the configuration is destroyed.
 @item vecset_type
 Type of the vecset, can be one of the following values:
 @table @code
 @item CASS_VECSET_SINGLE
 Any vecset in the table always has one and only one vector.
 @item CASS_VECSET_SET
 A vecset can have more than one vectors. This is the most common case.
 @item CASS_VECSET_NONE
 The table do not keep data for vecset, but only keep data for vectors.  This is
 used for sketch methods which disregard the vecset and only deal with vectors.
 @end table
 @item vec_type
 The type of the vector, can be one of the following values:
 @table @code
 @item CASS_VEC_INT
 32-bit integer, or @code{int32_t}.
 @item CASS_VEC_FLOAT
 32-bit floating point number, or @code{float}.
 @item CASS_VEC_BIT
 The feature vector is a bit vector.
 @item CASS_VEC_QUANT
 The feature vector is some kind of quantization. Used for certain index method.
 @end table
 @item vec_dim
 The dimension of the feature vector.  For bit vectors, it means the number of
 bits.
 @item vec_size
 The size of the feature vector.  Which include the feature vector and a weight.
 @item flags
 For now always set 0.
 @end table

 @subsection Table Creation
 @table @asis
 @item @emph{Prototype:}
 @code{int cass_table_create (cass_table_t **table,
                                 cass_env_t *env,
                                 char *table_name,
                                 uint32_t opr_id,
                                 int32_t cfg_id,
                                 int32_t parent_id,
                                 int32_t parent_cfg_id,
                                 int32_t map_id,
                                 char *param);}
 @item @emph{Description:}
 This routine creates a new table.  Note that this routine does not associate the
 file with its parent.  To do this, use
 @xref{x-tableassociate,,@code{cass_table_associate}}.
 @table @code
 @item table
 The pointer to the newly created table is return through this pointer.
 @item env
 The environment in which the table is created.
 @item table_name
 Name of the table.
 @item opr_id
 The ID of the table class.  See @xref{x-index,,Index and Sketch Reference} for
 possible values.
 @item cfg_id
 The ID of the table configuration.  For tables do not need a configuration, like
 index, @code{cfg_id} can be -1.
 @item parent_id
 The ID of the parent table.  For tables without a parent, use -1.
 @item parent_cfg_id
 The ID of the parent table's configuration.  Can be -1.
 @item map_id
 The ID of the map object to use.
 @item param
 Extra parameters to pass to the detail implementation.  See @xref{x-index,,Index
 and Sketch Reference} for details.
 @end table
 @end table

 @subsection Free a Table
 @table @asis
 @item @emph{Prototype:}
 @code{int cass_table_free(cass_table_t *table);}
 @item @emph{Description:}
 This routine frees the table from the main memory.  The on-disk table is not
 destroyed.
 @end table

 @subsection Describe a Table
 @table @asis
 @item @emph{Prototype:}
 @code{int cass_table_describe (cass_table_t *, CASS_FILE *);}
 @item @emph{Description}
 This routine writes textual description of the table to the provided stream.
 @end table

 @subsection Import and Export Data
 @table @asis
 @item @emph{Prototype:}
 @code{int cass_table_import_data(cass_table_t *table, char *fname); }

 @code{int cass_table_export_data(cass_table_t *table, char *fname); }

 @item @emph{Description:}
 These two routines import/export a table from/to a external text file.  See
 @xref{x-fileformat,,Data File Format} for details of the text file format.
 @end table

 @subsection Table Association
 @anchor{x-tableassociate}
 @table @asis
 @item @emph{Prototype:}
 @code{int cass_table_associate (cass_table_t *table, int32_t child);}

 @code{int cass_table_disassociate (cass_table_t *table, int32_t child);}

 @code{cass_table_t *cass_table_parent (cass_table_t *table);}
 @item @emph{Description:}
 @code{cass_table_associate} associates table @code{child} (the ID) to @code{table}, so @code{child} becomes a
 child of @code{table}.  When there are insertion to a table, the same data are
 automatically inserted to its children.

 @code{cass_table_disassociate} revokes the relationship.

 @code{cass_table_parent} returns the parent of the table, or @code{NULL} when
 the table has no parent.

 @end table


 @subsection Load and Release Feature Data
 @table @asis
 @item @emph{Prototype:}

 @code{int cass_table_load (cass_table_t *table);}

 @code{int cass_table_release (cass_table_t *table);}

 @item @emph{Description:}
 When the database is opened, the meta data of all the tables, as well as other
 objects are loaded into memory.  However, the real feature data,
 (or index/sketches) are not.  In this way, the system knows the structural
 information of the database without too much memory overhead.  The feature data
 are only loaded when necessary and are released when not needed.  If the feature
 data are modified, they are automatically dumped to disk before being released.
 You can refer to the @code{loaded} field of @code{cass_table_t} to determine if
 the feature data are loaded.
 The feature data must be loaded before data insertions or queries.

 @end table

 @subsection Data Insertion

 @table @asis
 @item @emph{Prototype:}
 @code{int cass_table_batch_insert (cass_table_t *table, cass_dataset_t *dataset,
 cass_vecset_id_t start, cass_vecset_id_t end);}

 @item @emph{Description:}
 This routine inserts the vecsets in @code{dataset} indexed by the range [@code{start}, @code{end}]
 into @code{table}.

 @end table

 @node Queries
 @section Queries

 There are two types of queries in the Ferret toolkit -- range queries and K-NN
 queries.  K-NN queries are well supported and there are index/sketches for
 various distance measures.  For each query, the user specifies one query vecset
 and the required range/# nearest neighbors, and the system tries the best to
 return the results.  Queries can be in two levels -- vecset level or vector
 level.  In vecset level, the vecsets in the table that are closest to the query
 vecset or within the specified range are returned.  In vector level,
 a queries is carried out for each vector in the query vecset, and the results
 are returned in multiple sets.  Also, in vector level queries, the dataset is
 the union of all the vecsets in the table, and the vecset-vector relationship is
 disregarded.  That means, the more than one vectors in the result set can belong
 to the same vecset.  Further more, the result sets can either be returned as an
 array or as a bitmap.

 To issue a query, the user need to prepare a query data structure to specify the
 query information, and provide a result structure to hold the return value.
 The related
 structures are described below.

 @smallexample
 typedef struct @{
         cass_id_t               id;
         cass_dist_t             dist;
 @} cass_list_entry_t;

 typedef ARRAY_TYPE(cass_list_entry_t) cass_list_t;

 typedef struct @{
         uint32_t                flags;

         union @{
         	bitmap_t        bitmap;
                 cass_list_t     list;
                 ARRAY_TYPE(bitmap_t) bitmaps;
                 ARRAY_TYPE(cass_list_t) lists;
         @};
 @} cass_result_t;

 typedef struct @{
         uint32_t                flags;

         cass_dataset_t         *dataset;
         cass_id_t               vecset_id;

         cass_size_t             topk;
         cass_dist_t             range;

         char                   *extra_params;
         cass_result_t          *candidate;

         int32_t                 vec_dist_id;
         int32_t                 vecset_dist_id;
 @} cass_query_t;
 @end smallexample

 The @code{cass_result_t} is a union plus a @code{flags}, which also appears in
 @code{cass_query_t} and gives information on how to interprete the union and other
 requirements. @code{flags} in @code{cass_query_t} specifies the user requirement
 and @code{flags} in @code{cass_result_t} specify the system response.  The
 possible values of the two are essentially the same.
 Specifically, the user can specify one of the following four
 values in the @code{flags} of the @code{cass_query_t} structure, to specify
 which representation of the result in @code{cass_result_t} is required.

 @table @code
 @item CASS_RESULT_BITMAP
 The @code{bitmap} field of the union is used.  The query should be in vecset
 level.
 @item CASS_RESULT_LIST
 The @code{list} field of the union is used.  The query should also be in vecset
 level.
 @item CASS_RESULT_BITMAPS
 The @code{bitmaps} field of the union is used.  The query should be in vector
 level.
 @item CASS_RESULT_LISTS
 The @code{lists} field of the union is used. The query should also be in vector
 level.
 @end table
 Note that for vector level queries, even if there is always only one vector in
 the vecset (the vecset is of the type @code{CASS_VECSET_SINGLE}), either
 @code{bitmaps} or @code{lists} should be used.

 The @code{flags} can also have one or more of the following ORed to its value.

 @table @code
 @item CASS_RESULT_MALLOC
 The memory to hold the real data in @code{list}/@code{lists} have not been
 allocated by the user and should be allocated by the system.
 @item CASS_RESULT_USERMEM
 The user has allocated memory in @code{list}/@code{lists} and the system should
 use the user provided memory.  This is useful when there are multiple contiguous
 queries, allowing the user to allocate memory once and reuse for the following
 queries .
 @item CASS_RESULT_REALLOC
 The user should not set this flag, but if it appears in the return value, it
 means that the user allocated memory is not enough and the system has
 reallocated the memory.
 @item CASS_RESULT_SORT
 The user requires that the result list to be sorted.
 @item CASS_RESULT_DIST
 The user requires that the @code{dist} field of @code{cass_list_entry_t} be filled if
 @code{list} or @code{lists} is used.  This value means the distance of the
 corresponding vector/vecset to the query vector/vecset.
 @end table

 It is important to note that all the above flags are treated as suggestions
 rather than mandatories.  The user should always check the @code{flags} of
 @code{cass_result_t} and interpret the result accordingly.  For example, some of the
 index/sketch algorithm do not refer to the original feature data, and have no
 way to figure out the distance between the data vector/vecset and the query
 vector/vecset, so even if the user requires @code{CASS_RESULT_DIST}, it will not
 be returned.  Some of the index/sketch algorithm keep the K-NN's in a heap, and
 it will be cheap to sort the heap with heap sort than regular quicksort.  In
 that case, if the user requres @code{CASS_RESULT_SORT}, the results will be
 sorted.  For other algorithms, if sorting within the query procedure is no
 faster than outside it by the user, the algorithm will disregard the sorting
 requirement.  Finally, some of the index method, like LSH, will always return
 bitmaps instead of lists.

 The other fields of the @code{cass_query_t} are explained below.

 @table @code
 @item dataset
 @item vecset_id
 The vecset in @code{dataset} specified by @code{vecset_id} is used as the query
 vecset.
 @item topk
 @item range
 If @code{topk > 0}, the system  does the K-NN query; otherwise, the system does
 the range query which is specified by @code{range}.
 @item extra_params
 The extra parameters, as a string, passed to the query algorithm.  The string is
 interpreted differently by different query index/sketch algorithms.
 @item candidate
 If @code{candidate != NULL}, then the query is carried out within the data
 provided by @code{candidate} instead of the whole dataset.  The often happens
 when the user wants to refine the result of a fast but inaccurate index
 algorithm with a more accurate one.  In that case, @code{candidate} can directly
 point to the previous query result.  In the case of vector level query, the
 order of lists/bitmaps in @code{candidate} should be the same as the order of
 vectors in the query vecset.

 @item vec_dist_id
 @item vecset_dist_id
 The vector distance and vecset distance to be used.  If it is a vector level
 query, @code{vecset_dist_id} is disregarded.

 @end table

 The following routines are used to issues queries to tables.

 @table @asis
 @item @emph{Prototype:}
 @code{int cass_table_query (cass_table_t *table, cass_query_t *query,
 cass_result_t *result);}

 @code{int cass_table_batch_query (cass_table_t *table, uint32_t count,
 cass_query_t **queries, cass_result_t **results);}

 @item @emph{Description:}
 The function of the batch query is same as doing single query multiple times.
 But for some index/sketch algorithm, doing multiple queries in a batch allows
 higher performance.  But this is not always the case.
 @end table

 @node Multiple Modality Support
 @section Multiple Modality Support

 @node Vector Distance Reference
 @section Vector Distance Reference
 @include vecdist.texi

 @node Vecset Distance Reference
 @section Vecset Distance Reference
 @include vecsetdist.texi

 @node Index and Sketch Reference
 @section Index and Sketch Reference
 @include index.texi