% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/ml_clustering_bisecting_kmeans.R
\name{ml_bisecting_kmeans}
\alias{ml_bisecting_kmeans}
\title{Spark ML -- Bisecting K-Means Clustering}
\usage{
ml_bisecting_kmeans(x, formula = NULL, k = 4, max_iter = 20,
  seed = NULL, min_divisible_cluster_size = 1,
  features_col = "features", prediction_col = "prediction",
  uid = random_string("bisecting_bisecting_kmeans_"), ...)
}
\arguments{
\item{x}{A \code{spark_connection}, \code{ml_pipeline}, or a \code{tbl_spark}.}

\item{formula}{Used when \code{x} is a \code{tbl_spark}. R formula as a character string or a formula. This is used to transform the input dataframe before fitting, see \link{ft_r_formula} for details.}

\item{k}{The number of clusters to create}

\item{max_iter}{The maximum number of iterations to use.}

\item{seed}{A random seed. Set this value if you need your results to be
reproducible across repeated calls.}

\item{min_divisible_cluster_size}{The minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster (default: 1.0).}

\item{features_col}{Features column name, as a length-one character vector. The column should be single vector column of numeric values. Usually this column is output by \code{\link{ft_r_formula}}.}

\item{prediction_col}{Prediction column name.}

\item{uid}{A character string used to uniquely identify the ML estimator.}

\item{...}{Optional arguments; currently unused.}
}
\value{
The object returned depends on the class of \code{x}.

\itemize{
  \item \code{spark_connection}: When \code{x} is a \code{spark_connection}, the function returns an instance of a \code{ml_estimator} object. The object contains a pointer to
  a Spark \code{Estimator} object and can be used to compose
  \code{Pipeline} objects.

  \item \code{ml_pipeline}: When \code{x} is a \code{ml_pipeline}, the function returns a \code{ml_pipeline} with
  the clustering estimator appended to the pipeline.

  \item \code{tbl_spark}: When \code{x} is a \code{tbl_spark}, an estimator is constructed then
  immediately fit with the input \code{tbl_spark}, returning a clustering model.

  \item \code{tbl_spark}, with \code{formula} or \code{features} specified: When \code{formula}
    is specified, the input \code{tbl_spark} is first transformed using a
    \code{RFormula} transformer before being fit by
    the estimator. The object returned in this case is a \code{ml_model} which is a
    wrapper of a \code{ml_pipeline_model}. This signature does not apply to \code{ml_lda()}.
}
}
\description{
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.
}
\examples{
\dontrun{
library(dplyr)

sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

iris_tbl \%>\%
  select(-Species) \%>\%
  ml_bisecting_kmeans(k = 4 , Species ~ .)
}

}
\seealso{
See \url{http://spark.apache.org/docs/latest/ml-clustering.html} for
  more information on the set of clustering algorithms.

Other ml clustering algorithms: \code{\link{ml_gaussian_mixture}},
  \code{\link{ml_kmeans}}, \code{\link{ml_lda}}
}
\concept{ml clustering algorithms}
