body-old.tex


%====================================================================
\section{Introduction}
\label{sec:intro}

A common approach to discovery in biology is to construct experiments or analyses that directly contrast two specific classes of biological objects. Examples of this include examining patient samples contrasting tumor versus normal tissue \needcite{} \agp{we should start filling this in.}, studying the differences in molecular effects of two competing drug treatments \needcite{}, or characterizing differentially expressed genes compared to genes with unaltered gene expression in a carefully designed experiment \needcite{}. As is often the case with genomics, these biological objects (e.g., tissue samples or drug experiments) are frequently represented by high dimensional numeric feature vectors (e.g., tens of thousands of transcript abundance measurements). To understand the mechanisms that lead to or explain these two object classes, researchers often employ statistical and machine learning tools to identify a manageable subset of features that accentuate or explain the differences between them. We term this the {\em separability problem}. Often, researchers solve this problem using methods to rank each singular feature using statistical tests \needcite{} and univariate classifiers \needcite{}. These approaches are fast and scale linearly as the number of features increases, but do not offer insight into the interplay between important features. Alternatively, larger subsets of separating features can be identified by more complex machine learning approaches, such as multivariate regression with LASSO regularization \needcite{} or pattern mining from random forests models \needcite{Han paper}. However, running these machine learning methods on datasets with tens of thousands of features is extremely time-intensive. Motivated by these observations, we sought to build a tool that strikes a balance between the two above-mentioned strategies, enabling us to retrieve more than just the best {\em singular} features, and doing so in an {\em interactive} manner.

In this paper, we present \genviz, an interactive data exploration tool that given a matrix of values for object features and two labeled object sets, rapidly identifies the top ranking pairs of features that most clearly separate the objects of the different classes. In this work, we find high ranking feature pairs that more significantly discriminate between the object classes than the corresponding best ranking individual features. By focusing on separating feature pairs, \genviz offers the researcher the ability to gain additional insight into the biological systems beyond singular features without the prolonged waiting time needed to train a complex machine learning model. Moreover, feature pairs can be effortlessly visualized in a two-dimensional interface. However, quickly finding top ranking feature pairs in typical biological datasets where there are many features is a very challenging task due to some fundamental limitations:
first, the number of feature pairs can be very large;
%we cannot avoid looking at all feature pairs since there may be surprising correlations that go beyond single features;
second, the feature-object matrix must be examined multiple times to evaluate the separability of all feature pairs;
third, the matrix precomputations that can reduce latency are incompatible with supporting on-the-fly submission of different possible object sets.
%third, we cannot assume anything about the matrix, e.g., whether it is low or high rank, since the two classes of objects may be provided on-the-fly.
For example, if objects were experimental conditions and features were gene expression measurements, having about $20K$ gene features means that we need to evaluate nearly 200 million possible feature pairs, and for each feature pair we need to at least scan all objects, say 100K objects, if using a naive approach. Therefore, sophisticated methods are required to make the task of ranking feature pairs practical and fast.

\begin{figure*}[t]
 \centering
 \includegraphics[width=0.9\linewidth]{fig/workflow2.pdf}
\caption{Overall \genviz Workflow. Given (a) a feature-object matrix and positive and negative class labels on the objects, where red refers to the negative objects and green refers to the positive objects. \genviz (b) searches and evaluates all pairs of features using several optimizations to find (c) the top feature pair and its corresponding visualization that best separate the object classes}
\label{fig:workflow}
\end{figure*}

In developing \genviz, we identified a measure to score the separability of the two classes of objects based on a given feature pair that is conducive to {\em (a)} optimization techniques for enabling fast computation of feature pair rankings and {\em (b)} visualization methods that allow researchers to appreciate the interplay between the values of the features and the class of the objects. Satisfying both of these requirements enables \genviz to offer separability analysis in a responsive exploration environment that promotes the generation and prioritization of hypotheses for further investigation. Specifically, we develop various novel optimizations to make \genviz more interactive, targeting: {\em (a)} how to eliminate repeated computations; {\em (b)} how to prune feature pairs during early execution; {\em (c)} how to use sampling technique to further reduce running time but with a statistical guarantee; {\em (d)} how to traverse the search space of all feature pairs for improved efficiency. We are not aware of any other work that proposes this suite of optimizations specifically for the separability problem.
\agp{we aren't selling our optimizations strongly enough. I have no idea of what the tool actually does to reduce time.}
\sinha{you're right. But is might be ok to be silent on time here}
\agp{why are these novel?}
\sinha{silu, check related work}
\silu{please have a check.}

We applied our \genviz tool to two genomic datasets. In one, called \lincs, we find pairs of genes whose expression discriminates between perturbagen experiments of different drugs and in the other, called \msig, we find pairs of annotations (such as pathway membership) that separates differentially expressed cancer genes from all others. Both of these applications are on objects with high-dimensional feature vectors, so it is computationally expensive to score the separability for all possible feature pairs. With the \genviz tool, its well-designed linear separability metric, and its suite of sophisticated optimizations that alter how to and which feature pairs to evaluate, we are able to accurately return the highest ranking separating feature pairs for both datasets within two minutes. This reflects a {\bf 180X} and {\bf 400X} speedup over a competitive baseline for the \msig and \lincs data sets respectively. With this speedup, \genviz enables researchers to quickly explore their data, identifying the strongest, most compelling features and forming hypotheses about the interplay between them and with the object classes. It also allows researchers to investigate different definitions of the object classes on the fly and investigate alternative hypotheses, as well as set up more in-depth, longer machine learning based analysis that builds upon the \genviz results. We performed an in-depth analysis for nine drugs in the \lincs dataset and found 1070 feature pairs that more significantly separated the perturbagen experiments than their corresponding single features and were enriched in literature support for known relationships between the genes and the drug, as well as between the pair of genes themselves. \genviz for separating gene sets is available as a web-based tool at knowcluster03.knoweng.org:9581 \cb{need to change} or as a general purpose source code at https://github.com/KnowEnG/Genvisage.
\silu{todo: add a screenshot}
\xagp{one line on key scientific insight would be good.}
\xagp{not clear (1) why is the problem hard? (2) what are prior approaches and why do they not apply? (3)what are the key ingredients of your ?}


%==================================================================
\section{Methods}
\label{sec:method}
We begin by formally defining the {\em separability} problem which, roughly speaking, is to find the best way to separate two sets of biological objects (such as genes or experiments) represented by high dimension vectors. We introduce the separability metric employed by \genviz, and finally detail several optimizations that enable the rapid identification of the best separating feature pairs.

%==================================================================
\subsection{Problem Definition}\label{sec:prob}

Let $\mm$ be a feature-object matrix of size $m\times N$, where each row is a feature and each column is an object as shown in Figure~\ref{fig:workflow}(a). An example feature-object matrix is one where each object corresponds to a tissue sample from a cancer patient and each feature correspond to a gene, such that the $(i,j)^{th}$ entry represents the expression level of the $i^{th}$ gene in the $j^{th}$ tissue sample. We denote the $m$ features as $\ff=\{f_1,f_2,\cdots,f_m\}$ and $N$ objects as $\oo=\{o_1,o_2,\cdots,o_N\}$. Each entry $\mm_{i,j}$ in $\mm$ is a number, representing the value of feature $f_i$ for object $o_j$ as illustrated in Figure~\ref{fig:workflow}(a).

In addition, we are given two non-overlapping sets of objects, one with a positive label and the other with a negative label, denoted as $\oo_+$ and $\oo_-$ respectively. Both sets are disjoint subsets of all objects, i.e., $\oo_+\subset \oo$, $\oo_-\subset \oo$, and $\oo_+\cap \oo_- = \emptyset$. In our example, we may have the tumor samples, $\oo_+$, be assigned the positive label, and the healthy tissue samples, $\oo_-$, be assigned the negative label. Let $\widehat{\oo}$ be the union of positive and negative objects and $n$ be the total number of labeled objects, i.e., $\widehat{\oo}=\oo_+\cup \oo_-$, $n=|\widehat{\oo}|$, where $n\leq N$. Also, let $l_k$ be the label of object $o_k\in \widehat{\oo}$, i.e., $l_k=1$ if $o_k$ is positive and $l_k=-1$ if $o_k$ is negative.

The goal of \genviz is as follows: given a matrix $\mm$ and two object sets $\oo_+$ and $\oo_-$ as depicted in Figure~\ref{fig:workflow}(a), \genviz aims to find one or more feature pairs that offer the best separation of objects in $\oo_+$ from those in $\oo_-$ represented by only those two features, and output a visualization that demonstrates the separability as shown in Figure~\ref{fig:workflow}(c). (We will define the metric for separability subsequently.) As we described in the introduction, a feature pair that leads to a good separation between the positive and the negative sets may be able to explain the differences between the two sets through a interesting, complex relationship of the features. The overall workflow is depicted in Figure~\ref{fig:workflow}. Since we intend \genviz to be an interactive data exploration tool that is employed before more time-consuming machine learning methods, we allow for the algorithm to make minor sacrifices to its accuracy in exchange for substantial improvements in its running time. We formally define the separability problem as follows.

\begin{formulation}[Separability]\label{prob:separability}
Given a feature-object matrix $\mm$ and two labeled object sets $(\oo_+,\oo_-)$, rapidly identify the \topk feature pairs $(f_i,f_j)$ separating $\oo_+$ from $\oo_-$ based on a given separability metric, and produce their corresponding {visualizations}.
\end{formulation}

Here a separability metric is a scoring function for any given feature pair, and \topk feature pairs refer to the $k$ feature pairs with the highest score under the separability metric. We will next describe our selected separability metric in Section~\ref{sec:metric}, and then discuss optimization techniques in Section~\ref{sec:opt}. We track our notation in Table~\ref{tbl:notation}.

\begin{table}[t!]
\centering
\small
\begin{tabular}{|c|c|c|c|}
 \hline
 Symb. & Description & Symb. & Description\\
 \hline
 \hline
 $\mm$ & feature-object matrix & $\ff$ & feature set in $\mm$ \\
 \hline
 $f_i$ & feature $i$ in $\ff$ & $m$ & number of features in $\ff$\\
 \hline
 $\oo$ & object set in $\mm$ & $N$ & number of objects in $\oo$\\
 \hline
 $\oo_+$ & positive object set & $\oo_-$ & negative object set\\
 \hline
 $\widehat{\oo}$ & labelled object set & $n$ & number of labelled objects in $\widehat{\oo}$\\
 \hline
 $o_k$ & object $k$ in $\widehat{\oo}$ & $l_k$ & label of object $o_k$\\
 \hline
 $\ell$ & separating line in 2-D & $\lhat$ & representative line in 2-D\\
 \hline
 $\eta_{i,j}^{\ell,k}$ & predicted label of $o_k$ & $\theta_{i,j}^{\ell,k}$ & $o_k$ is correctly separated? \\
 \hline
 $\theta_{i,j}^{\ell}$ & \# correctly separated $o_k$ & $\theta_{i,j}$ & separability score\\
 \hline
 $\widehat{\mm}$ & $\mm$ after transformation & $\tilde{\theta}_{i,j}$ & estimated $\theta_{i,j}$\\
 \hline
 \end{tabular}
\caption{\genviz Method Notation}
\label{tbl:notation}
\vspace{-18pt}
\end{table}


%==================================================================
\subsection{Separability Metric}\label{sec:metric}

Given a feature pair $(f_i,f_j)$ as axes, we can visualize the object sets $\oo_+$ and $\oo_-$ in a 2-D space, where each object corresponds to a point with x-value and y-value as the object's value on feature $f_i$ and $f_j$ respectively. A desirable (i.e., both interesting and interpretable) visualization would be one in which the objects are {\em linearly separated}, defined as follows. Two sets of objects, i.e., $\oo_+$ and $\oo_-$, are said to be \emph{linearly separable} if there exists at least one straight line in the 2-D space such that all points from $\oo_+$ are on one side of the line while all points from $\oo_-$ are on the other side. In this work, we focus on a {\bf linear separability metric} because it corresponds to an intuitive 2-D visualization.

Given a feature pair $(f_i,f_j)$, we can represent a line $\ell$ using Equation~\ref{eqn:line} below, where $x$ and $y$ represent an object's value on feature $f_i$ and $f_j$ respectively, and $w_0$, $w_i$ and $w_j$ are coefficients, where $w_j>0$.

\begin{equation}\label{eqn:line}
\begin{split}
\ell: \hspace{4mm} w_i\cdot x + w_j\cdot y +w_0 =0
\end{split}
\end{equation}

Given a feature pair $(f_i,f_j)$ and a line $\ell$, we can predict the label of an object $o_k$ based on the separating line in Equation~\ref{eqn:line}. Let $\eta_{i,j}^{\ell,k}$ be the predicted label of $o_k$, calculated using Equation~\ref{eqn:est_label} below:

\begin{equation}\label{eqn:est_label}
{\sf Predicted \hspace{1mm} Label:}\hspace{3mm} \eta_{i,j}^{\ell,k}=sign(w_i\cdot \mm_{i,k} + w_j\cdot \mm_{j,k} +w_0)
\end{equation}

If $o_k$ lies above the line $\ell$, \silu{i.e., $o_k$ has higher value on y-axis than the point on line $\ell$ with the same value on x-axis as $o_k$,} then $\eta_{i,j}^{\ell,k}=1$; otherwise, $\eta_{i,j}^{\ell,k}=-1$. Let $\theta_{i,j}^{\ell,k}$ be the indicator variable denoting whether an object $o_k$ is correctly separated (i.e., the sign of the predicted label is the same as the real label $l_k$):

\begin{equation}\label{eqn:s_object}
\theta_{i,j}^{\ell,k}=\left\{
  \begin{array}{ll}
   1 \textit{\hspace{2mm} if \hspace{2mm}} \eta_{i,j}^{\ell,k}\cdot l_k = 1\\
   0 \textit{\hspace{2mm} otherwise \hspace{2mm}}
  \end{array}
  \right.
\end{equation}

If there exists a line $\ell$ such that for every object $o_k\in \widehat{\oo}$, the object is correctly separated (i.e., $\theta_{i,j}^{\ell,k}$ = 1), then we say $\oo_+$ and $\oo_-$ are linearly separable.

% \begin{equation}\label{eqn:linear}
% \eta_{i,j}^{\ell,k} = \left\{
%   \begin{array}{ll}
%   1 \textit{\hspace{2mm} if \hspace{2mm}} o_k\in \oo_+, \textit{ i.e., } l_k=1 \\
%   -1 \textit{\hspace{2mm} if \hspace{2mm}} o_k\in \oo_-, \textit{ i.e., } l_k=-1
%   \end{array}
%  \right.
% \end{equation}

\vspace{2mm}

For \genviz, we rank feature pairs by a separability metric based on {\em how well the objects in the feature pair's 2-D visualization can be linearly separated}. We now formally define this separability metric.
%The basic intuition is that the users can easily recognize and interpret linear separability in a 2-D visualization. Thus if a visualization is perfectly linearly separated, it should have high score in our proposed separability metric; otherwise, we would find the best separating line $\ell$ in the visualization and report the largest number of correctly separated objects as the separability score.
Given a feature pair $(f_i, f_j)$ and a line $\ell$, the separability score of the line (denoted $\theta_{i, j}^\ell$) is defined as the sum of the indicators ($\theta_{i,j}^{\ell,k}$) for all objects, as shown in Equation~\ref{eqn:s_line}:

\begin{equation}\label{eqn:s_line}
\hspace{3mm} \theta_{i,j}^{\ell}= \sum_{k}{\theta_{i,j}^{\ell,k}}
\end{equation}

\begin{figure}[h]
\centering %%% not \center
%\vspace{-5mm}
\subfigure[Brute Force]{\label{fig:brute_force}\includegraphics[width=.235\textwidth]{fig/metric.pdf}}
\subfigure[Rocchio-Based]{\label{fig:rocchio}\includegraphics[width=.235\textwidth]{fig/rocchio.pdf}}
\vspace{-5mm}
\caption{Calculating Separability Score $\theta_{i,j}$. The scored separating line can be defined using (a) brute force or (b) the representative line from a Rocchio-based measure based on the object class centroids (white circles).}
\vspace{-5mm}
\label{fig:metric}
\end{figure}

Figure~\ref{fig:brute_force} shows separability scores $\theta_{i, j}^\ell$ for different separating lines. For example, the separating line with $\theta_{i, j}^\ell=12$ correctly separates six green points and six red points. The final separability score for a feature pair $(f_i,f_j)$ (denoted as $\theta_{i, j}$) is defined as the best separability score $\theta_{i, j}^{\ell}$ among all possible lines $\ell$. Accordingly, we define the overall separability error of the feature pair as $err_{i,j}=n-\theta_{i, j}$.

%As shown in Equation~\ref{eqn:s_viz},
%\begin{equation}\label{eqn:s_viz}
%\hspace{3mm} \theta_{i,j}= \max_{\ell}\{\theta_{i,j}^{\ell}\} %{\sf Separability \hspace{1mm} %Score:}
%\end{equation}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%NEED TOGGLE%%%%%%%%%%%%%%%%%%%%%%
% \begin{figure}[h]
% \centering
% \begin{subfigure}{0.235\textwidth}
% \centering
% \includegraphics[width=\linewidth]{fig/metrics.pdf}
% \vspace{-5mm}
% \caption{Brute Force}%{$\theta_{i,j}^{\ell}$ with Different Lines[e]}
% \label{fig:brute_force}
% \end{subfigure}
% \begin{subfigure}{.235\textwidth}
% \centering
% \includegraphics[width=\linewidth]{fig/rocchio.pdf}
% \vspace{-5mm}
% \caption{Rocchio-Based}%{$\theta_{i,j}^{\lhat}$ with Representative Line[f]}
% \label{fig:rocchio}
% \end{subfigure}
% \caption{Different methods to Calculate Separability Score $\theta_{i,j}$}
% \label{fig:metric}
% \end{figure}

\stitle{Brute Force Calculation of $\theta_{i,j}$.} As suggested in Figure~\ref{fig:brute_force}, the simplest way to calculate $\theta_{i,j}$ is to first enumerate all possible separating lines $\ell$ and calculate $\theta_{i,j}^\ell$ for each of them. This is infeasible as there are an infinite number of possible lines. However, we can easily trim down the search space to $O(n^2)$ lines by linking the points corresponding to every two objects in the 2-D plane. This is because the results of all other possible lines can be covered by these $O(n^2)$ lines~\cite{vapnik1998statistical}. Nevertheless, it is still very time-consuming to consider $O(n^2)$ lines for each feature pair $(f_i,f_j)$.

\stitle{Rocchio-based Measure.} Rather than enumerating through the $O(n^2)$ possible lines to find the one with maximum $\theta_{i,j}^{\ell}$, we propose to speed up the process by intelligently selecting a single {\em representative line} $\lhat$ that will provide us with an estimated separability score, $\theta_{i,j}^{\lhat}$ that approximates the true linear separability score $\theta_{i, j}$.
%as shown in Equation~\ref{eqn:s_viz_appr}, .
%\begin{equation}\label{eqn:s_viz_appr}
%\theta_{i,j} \leftarrow \theta_{i,j}^{\lhat} %\approx
%\end{equation}
In order to achieve a fast and reliable estimate, we select our representative line based on Rocchio's algorithm~\cite{rocchio1971relevance}.
%As we will show later in Section~\ref{sec:exp}, $\theta_{i,j}^{\lhat}$ is comparable to $\theta_{i,j}$ when using Rocchio-based representative line $\lhat$. Let us describe in detail about the Rocchio-based representative line.
%The intuition behind using Rocchio's algorithm is that each object's predicted label should be the class label of its nearest class centroid.
More specifically, let us denote the centroids of positive objects $\oo_+$ and negative objects $\oo_-$ for a given $(f_i,f_j)$ as $\mu_{i,j}^+=(\mm_i^+,\mm_j^+)$ and $\mu_{i,j}^-=(\mm_i^-,\mm_j^-)$ respectively, where $\mm_i^+$ and $\mm_j^+$ are the values of the centroids of the positive objects on feature $f_i$ and $f_j$, and $\mm_i^-$ and $\mm_j^-$ are the values of the centroids of the negative objects on feature $f_i$ and $f_j$ respectively. The perpendicular bisector of the two centroids is selected as the representative separating line $\lhat$. Mathematically, $\lhat$ can be represented using Equation~\ref{eqn:line} with $w_i$, $w_j$ and $w_0$ instantiated as shown in Equation~\ref{eqn:rep_line}:

%\lhat: \hspace{4mm} (\mm_i^+-\mm_i^-)\cdot x + (\mm_j^+-\mm_j^-)\cdot y -(\frac{(\mm_i^+)^2-(\mm_i^-)^2}{2}+\frac{(\mm_j^+)^2-(\mm_j^-)^2}{2}) =0
\begin{equation}\label{eqn:rep_line}
\begin{split}
& w_i = \mm_i^+-\mm_i^- \\
& w_j = \mm_j^+-\mm_j^- \\
& w_0 = -(\frac{(\mm_i^+)^2-(\mm_i^-)^2}{2}+\frac{(\mm_j^+)^2-(\mm_j^-)^2}{2})
\end{split}
\end{equation}

In Figure~\ref{fig:rocchio}, the "Rocchio-based" measure with the representative separating line $\theta_{i,j}^{\lhat}$ equals $13$ with one negative object (red point) mis-predicted as positive, while the true $\theta_{i,j}$ also equals $13$.

\stitle{Brute-force vs. Rocchio-based.} Compared to brute force, the Rocchio-based measure is much more light-weight in terms of running time, but at the cost of accuracy in calculating $\theta_{i,j}$. Intuitively, the representative line is a reasonable proxy to the best separating line since the Rocchio-based measure computes the centroids of the two classes as their representatives and assigns each object to its nearest centroid. We will further empirically demonstrate that $\theta_{i,j}^{\lhat}$ is a good proxy for $\theta_{i,j}$ in Section~\ref{sec:exp_comp}. Thus, we will focus on the Rocchio-based measure in the remainder of the manuscript, using $\theta_{i,j}$ and $\theta_{i,j}^{\lhat}$ interchangeably.
\xagp{you have used Rocchio's alg/measure/representative line. Maybe avoid some of them for consistency.}

%==================================================================
\subsection{Proposed Suite of Optimizations}\label{sec:opt}

In this section, we first analyze the time complexity of identifying the \topk feature pairs using the Rocchio-based measure of separability induced by each feature pair, and then propose several optimization techniques to reduce the complexity.

\stitle{Time Complexity Analysis.} First, consider the time required to calculate the separability score of a given feature pair $(f_i, f_j)$. If we have already calculated the class centroids for each feature, the separating line $\lhat$ can be calculated in $O(1)$ time using Equation~\ref{eqn:rep_line}. We must then calculate the number of correctly separated objects $\theta_{i,j}^{\lhat}$ by evaluating all objects with respect to the line, i.e., $O(n)$ evaluations. Since there are $O(m^2)$ feature pair candidates, the total time complexity is $O(m^2n)$. This running time can be very large, especially since $m$ and $n$ are typically large. In the following, we propose several optimizations to reduce the overall running time.

\stitle{High Level Overview of Optimizations.} Our optimizations fall into two categories. First, we introduce optimizations that reduce the amount of time required for fully evaluating a given feature pair and calculating its Rocchio-based separability measure (Section~\ref{ssec:trans},~\ref{ssec:earlyT}). Second, we introduce optimizations that reduce the number of feature pairs that require full evaluation without dropping many feature pairs that are in the \topk (Section~\ref{ssec:sampling},~\ref{ssec:traversal}).

In the first category of optimizations, the first optimization module, \trans (Section~\ref{ssec:trans}), reduces redundant calculations across feature pairs by mapping the original feature-object matrix $\mm$ into a new space that enables faster evaluation of object labeling, and thus faster calculation of separability score. The second optimization module, \earlyT (Section~\ref{ssec:earlyT}), takes advantage of the fact that evaluation of a poorly separating feature pair can be terminated early without having to evaluate the separability of all $n$ objects.

In the second category of optimizations, the \sampling module (Section~\ref{ssec:sampling}) attempts to first identify likely \topk feature pair candidates by evaluating their separability on a sampled subset of all objects, and then conduct full evaluations only on these feature pair candidates. As a result, the \sampling module is able to reduce the number of feature pairs that require full evaluation on all objects. Finally, the \traversal module (Section~\ref{ssec:traversal}) aims to reduce the number of feature pairs checked by greedily choosing feature pairs based on the separability of the corresponding individual, single features. These optimization modules can be used on their own or combined with each other.

In Section~\ref{sec:exp}, we will show how these optimizations modules greatly reduce the running time of finding the \topk separating feature pairs without significantly affecting the accuracy.

%==================================================================
\subsubsection{Pre-Transformation for Faster Feature Pair Evaluation} \label{ssec:trans}

Let us review the process of computing the separability score $\theta_{i,j}^{\lhat}$. Given a feature pair $(f_i,f_j)$ and the corresponding positive and negative centroids, {\em (i)} we first compute $w_0$, $w_i$ and $w_j$ for $\lhat$ based on Equation~\ref{eqn:rep_line}. Next, for each object $o_k$, {\em (ii)} we obtain the predicted label $\eta_{i,j}^{\lhat,k}$ according to Equation~\ref{eqn:est_label}. This step requires two multiplications and three additions. Finally, {\em (iii)} we calculate $\theta_{i,j}^{\lhat,k}$ and the separability score $\theta_{i,j}^{\lhat}$ based on Equation~\ref{eqn:s_object} and \ref{eqn:s_line} respectively. This whole process is repeated for every feature pair candidate. However, we observe that there exists massive redundancy across the processing of different feature pairs. For instance, when calculating the separability score for two different feature pairs $(f_i,f_j)$ and $(f_i,f_{j'})$ with a common $f_i$, $w_i$ is in fact shared, and calculation of $w_i\cdot \mm_{i,k}$ in Equation~\ref{eqn:est_label} is repeated for each object $o_k$. Given this, we propose to first transform $\mm_{i,k}$ into another space to reduce this computational redundancy.
\sinha{is there another work we can cite that has done this?}

\stitle{High Level Idea.} The high level idea for \trans is to first calculate some common computational components across different feature pairs, save them, and then utilize these components to compute the separability score for each feature pair. In this way, we can eliminate the repeated computations and reduce the running time dramatically. In the following, we identify the common computations across different feature pairs, transform $\mm_{i,k}$ into another space using those components, and update the separability score equation accordingly.

\stitle{Details.} For each feature $f_i$, we find the average values of the positive and negative objects for that feature, $\mm_i^+$ and $\mm_i^-$ respectively, and then we pre-transform $\mm_{i,k}$, i.e., the value of object $o_k$ on the feature $i$, to $\widehat{\mm}_{i,k}$ as illustrated in Equation~\ref{eqn:matrix_transform}.

\begin{equation}\label{eqn:matrix_transform}
\widehat{\mm}_{i,k} = \Big( (\mm_i^+-\mm_i^-)\cdot \mm_{i,k}-\frac{(\mm_i^+)^2-(\mm_i^-)^2}{2} \Big) \cdot l_k
\end{equation}

The basic idea is to decompose Equation~\ref{eqn:est_label} into two components, with each one only related to a single, individual feature. This pre-transformation incorporates the class centroids into the matrix values, obviating their integration later for every feature pair that involves the given feature. Equation~\ref{eqn:matrix_transform} also multiplies in the class label of the object, $l_k$, rather than repeating this multiplication every time the object is evaluated. With this transformation of the feature-object matrix, the calculation to evaluate whether an object was correctly separated is simplified from Equation~\ref{eqn:s_object} into Equation~\ref{eqn:s_object_transform}:
%, and get rid of checking the real label $l_k$ in Equation~\ref{eqn:s_object}

\begin{equation}\label{eqn:s_object_transform}
\theta_{i,j}^{\lhat,k}=\left\{
  \begin{array}{ll}
   1 \textit{\hspace{2mm} if \hspace{2mm}} sign(\widehat{\mm}_{i,k} + \widehat{\mm}_{j,k}) = 1\\
   0 \textit{\hspace{2mm} otherwise \hspace{2mm}}
  \end{array}
  \right.
\end{equation}

%\begin{equation}\label{eqn:est_label_transform}
%\eta_{i,j}^{\lhat,k}=\sign(\widehat{\mm}_{i,k} + \widehat{\mm}_{j,k})
%\end{equation}

After pre-transformation, we are now ready to compute the separability score $\theta_{i,j}^{\lhat}$. Given a feature pair $(f_i,f_j)$, we can compute $\theta_{i,j}^{\lhat,k}$ for each object $o_k$ based on Equation~\ref{eqn:s_object_transform}. Note that this step only involves one addition and one comparison. Next, we can calculate overall separability score $\theta_{i,j}^{\lhat} = \sum_{k}{\theta_{i,j}^{\lhat,k}}$ similarly to Equation~\ref{eqn:s_line} without the pre-transformation. In all, compared to evaluations without the pre-transformation, we not only eliminate the steps of computing $w_0$, $w_i$ and $w_j$ for every feature pair, but also reduce the cost of calculating $\eta_{i,j}^{\lhat,k}$ in Equation~\ref{eqn:est_label}. In the following sections, we will consider $\widehat{\mm}$ instead of $\mm$, and Equation~\ref{eqn:s_object_transform} instead of Equation~\ref{eqn:est_label} and~\ref{eqn:s_object}.
%Hence, by conducting transformation in Equation~\ref{eqn:matrix_transform} and~\ref{eqn:s_object_transform}, we have mapped our problem to a new space.

\begin{figure}[h]
\centering %%% not \center
%\vspace{-5mm}
\subfigure[Pre-Transformation]{\label{fig:transform}\includegraphics[width=.235\textwidth]{fig/transformation.pdf}}
\subfigure[Early Termination ($\tau = 1$)]{\label{fig:earlyT}\includegraphics[width=.235\textwidth]{fig/earlyT.pdf}}
\vspace{-5mm}
\caption{Optimizations for Feature Pair Evaluation. (a) The \trans module is applied once to the original values in the feature-object matrix $\mm$ (above) using Equation~\ref{eqn:matrix_transform} to produce $\widehat{\mm}$ (below), which will reduce later calculations. (b) With the \earlyT module, a feature pair can be discarded from consideration for the \topk before all objects are evaluated (above) with even greater speedup potential after re-ordering the objects (below).}
\vspace{-5mm}
\label{fig:trans_term}
\end{figure}

\begin{example}[Transformation]
Figure~\ref{fig:transform} illustrates the transformation for two features, $f_i$ and $f_j$. The top half depicts $\mm_{i,k}$ and $\mm_{j,k}$ before transformation, where green color represents a positive label and red color represents a negative label. In this example, the centroids of the positive and negative objects are $\mu_{i,j}^+=(5,7)$ and $\mu_{i,j}^-=(3,5)$ respectively. Hence, we can rewrite Equation~\ref{eqn:matrix_transform} as $\widehat{\mm}_{i,k} = (2\mm_{i,k}-8)\cdot l_k$ and $\widehat{\mm}_{j,k} = (2\mm_{i,k}-12)\cdot l_k$ for features $f_i$ and $f_j$ respectively. After calculation, we can obtain the values for $\widehat{\mm}_{i,k}$ and $\widehat{\mm}_{j,k}$ shown in the bottom half of Figure~\ref{fig:transform}.
\end{example}

\sinha{at this point, write a summary of the time complexity of calculating $\theta_{i,j}^{\lhat}$ for a feature pair i,j using transformation (include preprocessing separately)}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%NEED TOGGLE%%%%%%%%%%%%%%%%%%%%%%
% \begin{figure}[h]
% \centering
% \begin{subfigure}{0.235\textwidth}
% \centering
% \includegraphics[width=\linewidth]{fig/transformation.pdf}
% \vspace{-5mm}
% \caption{Pre-Transformation}%{$\theta_{i,j}^{\ell}$ with Different Lines}
% \label{fig:transform}
% \end{subfigure}
% \begin{subfigure}{.235\textwidth}
% \centering
% \includegraphics[width=\linewidth]{fig/earlyT.pdf}
% \vspace{-5mm}
% \caption{Early Termination ($\tau = 1$)}%{$\theta_{i,j}^{\lhat}$ with Representative Line}
% \label{fig:earlyT}
% \end{subfigure}
% \caption{Different Methods to Reduce Running Time}
% \label{fig:metric}
% \end{figure}

\stitle{Extension to Single Features.} For a feature pair $(f_i,f_j)$ where $i=j$, we can treat the corresponding $\theta_{i,i}$ as the separability score for a single feature $f_i$. Equation~\ref{eqn:s_object_transform} can be further reduced to Equation~\ref{eqn:s_object_transform_single}.

\begin{equation}\label{eqn:s_object_transform_single}
\theta_{i,i}^{\lhat,k}=\left\{
  \begin{array}{ll}
   1 \textit{\hspace{2mm} if \hspace{2mm}} sign(\widehat{\mm}_{i,k}) = 1\\
   0 \textit{\hspace{2mm} otherwise \hspace{2mm}}
  \end{array}
  \right.
\end{equation}

Hence, we can calculate the separability score $\theta_{i,i}$ very efficiently for each single feature, based on the transformed matrix $\widehat{\mm}$.

%==================================================================
\subsubsection{Early Termination} \label{ssec:earlyT}

Given a feature pair $(f_i,f_j)$ and line $L$, we need to make a full scan of all objects to compute the separability score $\theta_{i,j}^{\lhat}$ according to Equation~\ref{eqn:s_line}. However, our goal is to not compute the separability score for all feature pairs, instead we only need to identify the \topk feature pairs. Thus, we propose to reduce the running time using early termination when scanning the objects, denoted as the \earlyT module as shown in Figure~\ref{fig:workflow}.

\stitle{High Level Idea.} The basic idea is to maintain a upper bound for the separability error $err_{i,j}$ of the \topk feature pairs, denoted as $\tau$. Correspondingly, the lower bound of the separability score can be denoted as ($n-\tau$), where $n$ is the total number of objects $\hat{\oo}$. Given a feature pair $(f_i,f_j)$, we start to scan the object list until the number of incorrectly classified objects exceeds $\tau$. If so, we can terminate early and prune this feature pair since it cannot be among the \topk feature pairs. Otherwise, if the number of incorrectly classified objects never exceeds $\tau$ during this scan, then $(f_i,f_j)$ is added to the \topk feature pair set and we update $\tau$ accordingly. Although \earlyT has the potential to always reduce the running time, its benefits are very sensitive to the ordering of the objects for evaluation. Since we terminate as soon as we find $\tau$ incorrectly classified objects, we can improve our running time if we are able to examine ``problematic'' objects that are unlikely to be correctly classified relatively early.

\stitle{Enhancement by Object Ordering.} To evaluate ``problematic'' objects earlier, we order the objects in descending order of the number of single features $f_i$ that incorrectly classify the object $o_k$, i.e., $\widehat{\mm}_{i,k}\leq0$. Therefore, the first object evaluated is the one that is incorrectly classified by the most single features. We study the impact of this strategy in Section~\ref{sec:exp}.

\begin{example}[\earlyT \& Object Ordering]
Figure~\ref{fig:earlyT} illustrates how early termination and object ordering help reduce the running time. The top half depicts the $\widehat{\mm}$ after transformation. Say the current error upper bound is $\tau=1$. We start scanning the object list from left to right and determine whether each object is correctly separated based on Equation~\ref{eqn:s_object_transform}. Incorrectly classified objects are marked in blue, such as $o_1$, since $\widehat{\mm}_{i,1}+\widehat{\mm}_{j,1}\leq0$. We can terminate after evaluating $o_5$ since the number of incorrectly classified objects exceeds $\tau=1$.

The bottom half of Figure~\ref{fig:earlyT} shows $\widehat{\mm}$ after object reordering where "problematic" objects are placed on the left. In this way, after checking only the first two objects, we can terminate and eliminate this feature pair from further consideration.
\end{example}

\cb{note, 0's are treated as mis-labels and ties are broken randomly.}

%==================================================================
\subsubsection{Sampling-based Estimation} \label{ssec:sampling}

We proposed the \earlyT module in Section~\ref{ssec:earlyT} to reduce the number of objects that need to be examined, from $n$ to a smaller number. However, there is no guarantee for the improvement in the running time performance since the termination point is highly data-dependent. In this section, we propose a stochastic method, called \sampling, that further reduces the number of examined objects. Instead of calculating $\theta_{i,j}$ over the whole object set $\hat{\oo}$, \sampling works on a sample set drawn from $\hat{\oo}$.
\sinha{use $\theta_{i,j}^{\lhat}$ instead of $\theta_{i,j}$}
\cb{we introduced this simplification at the end of Section \ref{sec:metric}}

\stitle{High Level Idea.} \sampling primarily consists of two phases, called {\em candidate generation} and {\em candidate validation} as shown in Figure~\ref{fig:sampling}. In phase one (Figure~\ref{fig:sampling}(a)), we first estimate the confidence interval of $\theta_{i,j}$ for each feature pair using a sampled set of objects and then generate the candidate feature pairs for full evaluation based on where their confidence intervals lie. If the confidence interval overlaps with the score range of the current \topk, then it is selected for evaluation. In phase two (Figure~\ref{fig:sampling}(b)), we evaluate only the feature pairs in the candidate set, calculating $\theta_{i,j}$ over the whole object set, $\hat{\oo}$, to obtain the final \topk feature pairs. Unlike our previous optimizations, \sampling returns an approximation of the \topk ranking feature pairs. In Section~\ref{sec:exp_comp}, we study how the greatly reduced running time of the \sampling module affects the accuracy of the \topk pairs. %We will describe in detail in Section~\ref{sssec:generate} and \ref{sssec:validate}.

\begin{figure}[t!]
 \centering
 \vspace{-2mm}
 \includegraphics[width=\linewidth]{fig/sampling.pdf}
 \vspace{-9mm}
\caption{Phases of \sampling. (a) To calculate the \topthree feature pairs, the confidence interval of $\theta_{i,j}$ is calculated for all feature pairs evaluated on the sample set $\sss$ and a subset of candidates (blue intervals) are selected for validation. (b) The five selected candidates (center box) are evaluated on the whole object set $\hat{\oo}$ to compute the exact $\theta_{i,j}$ and pick the \topthree (rightmost box).}
\vspace{-5mm}
\label{fig:sampling}
\end{figure}

\stitle{Candidate Generation.} %\label{sssec:generate}
Let $\sss$ be a sample set drawn uniformly from the object set $\hat{\oo}$. Given a feature pair $(f_i,f_j)$, let $\theta_{i,j}(\sss)$ be the number of correctly separated objects in $\sss$. Let $\tilde{\theta}_{i,j}$ be the estimated separability score of all objects based on the sample set $\sss$. First, we can estimate $\tilde{\theta}_{i,j}$ from $\theta_{i,j}(\sss)$ using Equation~\ref{eqn:est_score} by assuming the ratio of correctly separated samples in $\sss$ is the same as that in $\hat{\oo}$.

\begin{equation}\label{eqn:est_score}
\tilde{\theta}_{i,j} = \frac{\theta_{i,j}(\sss)}{|\sss|} \cdot n
\end{equation}

Next, we will show that with a constant sample set size $|\sss|$, $|\theta_{i,j}-\tilde{\theta}_{i,j}|$ is bounded by a small value with high probability, using Hoeffding bound inequality~\cite{hoeffding1963probability}. We formally quantize the sample set size in Theorem~\ref{them:est}.

\begin{theorem}[Estimation Accuracy]\label{them:est}
By considering $\Omega(\frac{1}{\epsilon^2}\cdot \log(\frac{1}{\delta}))$ samples, with probability at least $1-\delta$, we have $|\frac{\theta_{i,j}(\sss)}{|\sss|}-\frac{\theta_{i,j}}{n}| \leq \epsilon$, i.e., $|\tilde{\theta}_{i,j}-\theta_{i,j}|\leq \epsilon n$.
\end{theorem}

\begin{proof}
   Based on Hoeffding's inequality:
   $$Pr(|\frac{\theta_{i,j}(\sss)}{|\sss|}-\frac{\theta_{i,j}}{n}| \geq \epsilon) \leq 2e^{-2|\sss|\epsilon^2}$$
   Hence, when $|\sss| = \frac{1}{2\epsilon^2}\cdot \log(\frac{2}{\delta}) = \Omega(\frac{1}{\epsilon^2}\cdot \log(\frac{1}{\delta}))$, we have:
   $$Pr(|\frac{\theta_{i,j}(\sss)}{|\sss|}-\frac{\theta_{i,j}}{n}| \geq \epsilon) \leq \delta$$
\qed
\end{proof}

We can treat $\log(1/\delta)$ as a constant, e.g., by setting $\delta = 0.05$. Thus, Theorem~\ref{them:est} essentially states that with only $\Omega(\frac{1}{\epsilon^2})$ samples, with probability 95\%, the confidence interval for $\theta_{i,j}$ is $[\tilde{\theta}_{i,j}-\epsilon n, \tilde{\theta}_{i,j}+\epsilon n]$. Note that the sample size $|\sss|$ only depends on $\epsilon$ and is independent of the number of objects, $n$. Hence, the \sampling module helps \genviz scale to datasets with large $n$.
%For instance, when setting $\epsilon=0.05$, the sample size is fixed to be $400$.

With Theorem~\ref{them:est}, we can first calculate the confidence interval of $\theta_{i,j}$ for each feature pair $(f_i,f_j)$, as shown in Figure~\ref{fig:sampling}(a). Next, we compute the lower bound of $\theta_{i,j}$ for the \topk feature pairs, denoted as $\zeta$ as shown by the red dotted line in Figure~\ref{fig:sampling}(a). Finally, we can prune feature pairs away whose upper bound is smaller than $\zeta$. Accordingly, the candidate set $\cc$ corresponds to the feature pairs depicted by the blue intervals in Figure~\ref{fig:sampling}(a). These feature pairs $\cc$ will be further validated in phase two, i.e., candidate validation. Typically, $|\cc|$ will be orders of magnitude smaller than $m^2$, the original search space for all feature pairs. We will verify this later in our evaluations (Section~\ref{sec:exp}).

\begin{example}[Candidate Generation]
Let us illustrate phase one using Figure~\ref{fig:sampling}(a). Set $\epsilon = 0.05$ and $K=3$. We sample a subset of all objects and compute the confidence interval of $\theta_{i,j}$ for each $(f_i,f_j)$. Next, we obtain the $3^{rd}$ lower bound $\zeta$ (red dotted line) in descending order among all the confidence intervals. Finally, we designate all feature pairs whose upper bound is larger than the red dotted line as blue feature pair candidates $\cc$.
\end{example}

\stitle{Candidate Validation.} %\label{sssec:validate}
Since we only estimated the separability scores $\theta_{i,j}$ for our feature pair candidates in phase one, we re-evaluate all of the candidates generated from phase one to produce our final feature pair ranking. This phase two evaluation is performed using the whole object set $\hat{\oo}$. \sinha{are we using early termination and object ordering in phase two?} \silu{no. We tried both and found that early termination is not helping most of time since 1) the feature pair candidates, which perform well in phase 1, are indeed the ones that are hard to prune and 2) \earlyT incurs some overhead itself.} For each feature pair $(f_i,f_j)\in \cc$, we calculate the separability score $\theta_{i,j}$ and find the \topk feature pairs (Figure~\ref{fig:sampling}(b)).

\stitle{Enhancement by Candidate Ordering.} Recall that in Section~\ref{ssec:earlyT} we proposed an enhancement that allows us to terminate computation by cleverly manipulating the order of the objects; here we similarly found a way to reduce the running time by changing the order in which feature pair candidates are validated in phase two. Instead of directly validating each feature pair candidate, we first order the candidates in descending order according to the upper bound of each candidate's confidence interval. Then, we sequentially calculate the full separability score $\theta_{i,j}$ for each feature pair, and update $\zeta$ correspondingly. Recall that $\zeta$ is the current estimate of the lower bound of $\theta_{i,j}$ for the \topk feature pairs. Finally, we terminate our feature pair validation when the next feature pair's upper bound smaller than the current value of $\zeta$ (Figure~\ref{fig:candidate_ordering}).

\begin{figure}[h]
 \centering
 \vspace{-5mm}
 \includegraphics[width=\linewidth]{fig/candidate_ordering.pdf}
 \vspace{-6mm}
\caption{Enhancement by Candidate Ordering. (a) Feature pair candidates are sorted by the upper bounds of their confidence intervals (solid red boundary), and the lower bound of the \topthree feature pairs, i.e., $\zeta$, is set (red dotted line). (b,c,d) For each feature pair, we calculate $\theta_{i,j}$ (filled blue circle) using all objects and update $\zeta$ if necessary. Note that $\zeta$ is increased in (d) after evaluating the third feature pair and since $\zeta$ is larger than the upper bound of the fourth feature pair, candidate validation phase can terminate and return the top ranking feature pairs.}
\vspace{-5mm}\
\label{fig:candidate_ordering}
\end{figure}

\xsinha{ example descriptions moved to figure legends}
\sinha{ some readers will be confused by higher values being on the left}

%==================================================================
\subsubsection{Search Space Traversal} \label{ssec:traversal}

We have discussed above various optimizations that check fewer than $n$ objects for each feature pair and reduce the number of feature pairs for full evaluation in order to reduce the overall running time. In this section, we propose an optimization to reduce the number of feature pairs considered from $m^2$. With our \traversal module, we propose optimized traversals of the feature pair search space that guide the selection of the reduced the number of feature pairs considered for evaluation.

\stitle{High Level Idea.} The basic idea is that instead of examining each possible feature pair, we only examine a limited number of feature pairs, but in an optimized traversal order. The number of examined feature pairs determines a trade-off between efficiency and accuracy. In general, fewer feature pairs checked will result in faster running times, though at the cost of accuracy in terms of the output of the true \topk feature pairs.

The full search space for the feature pairs is the upper triangle in Figure~\ref{fig:traversal}(a), where each row represents $f_i$ and each column represents $f_j$. Let $\chi$ be the limit on the number of feature pairs to be examined. We propose two different traversal orders over the search space: horizontal traversal and vertical traversal, as follows:
\begin{itemize}
\item \emph{Horizontal traversal (Figure~\ref{fig:traversal}(a)):} for each feature $f_i$, pair it with each other feature $f_j$, where $j\geq i$, to obtain $(f_i,f_j)$. Repeat for each $f_i$, where $1 \leq i\leq m$.
\item \emph{Vertical traversal (Figure~\ref{fig:traversal}(b)):} for each feature $f_j$, pair it with each other feature $f_i$, where $i\leq j$, to obtain $(f_i,f_j)$. Repeat for each $f_j$, where $1 \leq j\leq m$.
%\item \emph{Diagonal traversal (Figure~\ref{fig:traversal}(c)):} for a fixed value $v$, pair $(f_i,f_{v-i})$, where $1\leq i<v$. Repeat it for each $v$, where $2 \leq v\leq 2m$.
\end{itemize}

However, the traversal order does not convey any intuition if the feature set $\ff$ is in random order. Hence, we further propose to enhance \traversal by feature ordering in $\ff$.

\stitle{Enhancement by Feature Ordering.} We propose to first sort features based on single features' separability scores $\theta_{i,i}$. Single features with good separability individually are ranked higher. Let $\{f_1^{'} f_2^{'},\cdots,f_m^{'}\}$ be the sorted single-feature set. Next, we discuss the intuition behind the two different traversal order schemes. Consider partitioning the sorted single-feature list into three categories based on the single-feature separability score $\theta_{i,i}$: {\em good, average,} and {\em poor}.

\begin{itemize}
\item \emph{Horizontal traversal:} the intuition is that there is at least one {\em good} or {\em average} single feature in each \topk feature pair $(f_i,f_j)$.
\item \emph{Vertical traversal:} the intuition is that both features in each \topk feature pair $(f_i,f_j)$ must be categorized as {\em good} or {\em average} individually.
%\item \emph{Diagonal traversal:}
\end{itemize}

We examine the impact of these traversal schemes in our evaluations.

%The intuition behind horizontal traversal is that there is at least one {\em top} or {\em median} single features in the \topk feature pairs $(f_i,f_j)$; while the intuition behind horizontal traversal is that both features in \topk feature pairs $(f_i,f_j)$ must rank {\em top} or {\em median} by their own.

\begin{figure}[h]
 \centering
% \vspace{-5mm}
 \includegraphics[width=0.85\linewidth]{fig/traversal.pdf}
 \vspace{-5mm}
\caption{\traversal Orderings. \genviz is able to examine the feature pair search space (a) horizontally or (b) vertically.}
\vspace{-5mm}
\label{fig:traversal}
\end{figure}

%Typically, $K$ is not large, e.g., {\sc Top-1000}, while $m$ is large, e.g., {\sc 20000}.
\begin{example}[\traversal]
Assume $m=2\times 10^4$. Initially, the number of possible feature pairs in the whole search space is roughly $\frac{m^2}{2}=2\times 10^8$. However, if we limit the number of considered feature pairs to $\chi=10^7$, we reduce our search space to $\frac{1}{20}$ of the total number of feature pairs. We order the single features by their individual separability scores. In horizontal traversal, only feature pairs with at least one individual feature ranked in the top 500 will be considered; while vertical traversal will consider only feature pairs with both individual features ranked better than 2000.
%the "worst" $f_i$ ranks around 500, while $f_j$ can be any feature in $\ff$. On the other hand, in vertical traversal, the "worst" $f_i$ (and $f_j$) ranks around 2000.
\end{example}


%==================================================================
\section{Results}
\label{sec:exp}
In this section, we will illustrate that \genviz can rapidly identify meaningful feature pairs for the separability problem in real biological datasets which provide significant and interesting insights. First, we describe the datasets and algorithms used in our evaluations. Each algorithm that we evaluate represents a combination of optimization modules for ranking \topk feature pairs using our Rocchio-based separability measure. We report the running time and accuracy of the selected algorithms. Finally, we enumerate and visualize the \topk feature pairs returned by \genviz, compare them to the corresponding \topk single features, and examine their significance and support in existing publications.

%==================================================================
\subsection{Evaluation Setup}

\stitle{Datasets.} We conducted our tests on two different biological applications. In the first application, which we call \msig, we find pathways and other gene annotations that separate the differentially expressed genes from the undisturbed genes in specific cancer studies. In the other application, called \lincs, we find genes whose expression levels can distinguish experiments in which specific drug treatments were administered from others.

More specifically, in the \msig application, we are given a feature-object matrix with genes as the objects and different gene annotations as the features. To obtain the values in this matrix, we perform a random walk on a heterogeneous network that restarts at the feature gene set annotation node \needcite{DRAWR} and then observe the stationary probabilities of being at each gene node. The heterogeneous network here was constructed with information from BLAST, Gene Ontology, Pfam, and Pathway \needcite{each of these sources}. The values in the matrix correspond to the network connectivity between the node representing the feature gene set annotation and the node representing the object gene in the heterogeneous network of prior biological knowledge.
%The values in this matrix correspond to the network connectivity between the node representing the feature gene set annotation and the node representing the object gene in a heterogeneous network of prior biological knowledge. To calculate these network connectivity scores we perform a random walk on the heterogeneous network that restarts at the feature gene set annotation node \needcite{DRAWR} and then observe the stationary probabilities of being at each gene node. The heterogeneous network here was constructed with information from BLAST, Gene Ontology, Pfam, and Pathway \needcite{each of these sources}.
\cb{still need to be certain on the steps required to create this matrix, also may want to mention restart prob, etc} The positive genes for each dataset in the \msig collection is a set of differentially expressed genes in a specific cancer study, as per Molecular Signatures Database \needcite{MSIGDB}. Each of our tests searches for pairs of annotations that separate the differentially expressed genes of cancer study(the ``positive'' set) from all other genes(the ``negative'' set).

In the \lincs application, our feature-object matrix contains values for different genes (features) across many drug treatment experiments (objects) conducted on the MCF7 cell line by the LINCS L1000 project \needcite{lincs}. The values of the matrix are gene expression quantities as reported by the ``level-4' imputed z-scores measured in the L1000 project. The positive object sets for \lincs are multiple experiments that used the same drug, at varying dosages and for varying durations. Each test attempts to find the top pairs of genes whose expression separates the \lincs experiments relating to a single drug from all other \lincs experiments.

\begin{table}[h]
\centering
%\vspace{-5mm}
\small
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
 \hline
 & $|\ff|$=$m$ & $|\oo|$=$N$ & $|\sss|$ & $\chi$ & \# of $\widehat{\oo}$ & avg($|\oo_+|$) & avg($|\oo_-|$) \\
 \hline
 \msig & 19,912 & 22,209 & 400 & $10^7$ & 10 & 295 & 21,914 \\
 \hline
 \lincs & 22,268 & 98,061 & 400 & $10^7$ & 40 & 165 & 97,897 \\
 \hline
 \end{tabular}
\caption{Dataset Statistics. For each dataset, the number of features $m$, objects $N$, sample size $|\sss|$ used by \sampling module, feature pairs $\chi$ examined by \traversal module, number of object sets: \# of $\widehat{\oo}$, average positive set size: avg($|\oo_+|$), and average negative set size: avg($|\oo_-|$).}
\label{tbl:dataset}
\vspace{-18pt}
\end{table}

\sinha{why did we chose the number of object sets, 10 and 40}

For each of the two applications, the number of features $m$, objects $N$, sample size $|\sss|$, and the examined feature pairs $\chi$ are depicted in Table~\ref{tbl:dataset}. We tested on 10 different cancer gene sets for \msig and 40 different drug experimental sets for \lincs as shown by "\# of $\widehat{\oo}$" in Table~\ref{tbl:dataset}. For each positive object set, we treat all of the remaining objects as the negative set and list the average number of positive and negative objects in Table~\ref{tbl:dataset}. Note that the average number of positive objects is far fewer than the average number of negative objects. In order to balance between the number of positive objects and negative objects, we adjust Equation~\ref{eqn:s_line} to a weighted sum form as shown in Equation~\ref{eqn:s_line_weighted}.

\begin{equation}\label{eqn:s_line_weighted}
\hspace{3mm} \theta_{i,j}^{\ell}= \sum_{o_k\in{\oo_-}}{\theta_{i,j}^{\ell,k}}+\frac{|\oo_-|}{|\oo_+|}\cdot\sum_{o_k\in{\oo_+}}{\theta_{i,j}^{\ell,k}} %{\sf Separability \hspace{1mm} Score\hspace{1mm} with\hspace{1mm} Line\hspace{1mm} \ell:}
\end{equation}

%\stitle{Metrics.} We focus on two major metrics in our test: separability score and time. Specifically, each algorithm outputs $k$ feature pairs with highest separability score, i.e., \topk feature pairs, and we call the smallest separability score among the \topk feature pairs as \topk separability score.
%, where $m$ and $n$ are constants depending on the size of the feature-object matrix, $|\sss|$ and $\chi$ are the predetermined sampled set size and limit on the number of feature pairs considered, and $|\cc|$, the number of feature pair candidates is variable and is generated during the algorithm

\begin{table*}[t]
\centering
\small
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
 %\hline

 \hline
 & Transformation & \earlyT & Object Ordering & \sampling & Candidate Ordering & \traversal & Feature Ordering & Complexity\\
 \hline
 \baseline & yes & no & no & no & no & Any & no & $O(m^2n)$\\
 \hline
 %\early & yes & yes & no & no & no & Any & no & $O(m^2n)$\\
 % \hline
 \earlyOrder & yes & yes & yes & no & no & Any & no & $O(m^2n)$\\
 \hline
 \samp & yes & no & no & yes & no & Any & no & $O(mn+m^2|\sss|+|\cc|n)$\\
 \hline
 \sampOpt & yes & no & no & yes & yes & Any & no & $O(mn+m^2|\sss|+|\cc|n)$\\
 \hline
 \horiz & yes & no & no & yes & yes & Horizontal & yes & $O(mn+\chi|\sss|+|\cc|n)$\\
 \hline
 \vertic & yes & no & no & yes & yes & Vertical & yes & $O(mn+\chi|\sss|+|\cc|n)$ \\
 \hline
 \end{tabular}
\caption{Selected Algorithms Using Different Optimization Modules. For each algorithm (row), shows which optimization modules are employed and the running time complexity for that combination where $m$ and $n$ are the number of features and objects, $\sss$ is the sampled set size, $\chi$ is the limit on the number of feature pairs considered, and $\cc$ is the number of generated feature pair candidates.}
\label{tbl:alg}
\vspace{-18pt}
\end{table*}

\sinha{maybe naming algorithms A0, A1-A5.}
\sinha{describe table as tree?}
\agp{add approximate column}

\stitle{Algorithms.} In our evaluation, we experimented with six different selected combinations of the optimization modules introduced in Section~\ref{sec:opt}, as listed in Table~\ref{tbl:alg}. Other combinations that were not selected were always inferior to the ones shown, e.g., the unlisted algorithm with only \trans and \earlyT modules always has a worse running time and outputs the \topk feature pairs with the same separability score as the selected \earlyOrder combination. For our baseline, \agp{if we are using this for our baseline than I propose we discard its description if we are lacking in space...let's discuss} we use the algorithm with only the feature-object matrix pre-transformation optimization module (\trans). The last column in Table~\ref{tbl:alg} shows the running time complexity of different algorithms. Take \horiz as an example on how the running times are calculated. First, the \trans module takes $O(mn)$ time. \cb{ note, scoring single features takes O(mn) time and uses only transformation optimization.} Then, the \traversal module requires a sorting over the feature set, taking $O(m\log m)$\footnote{Since $m\log m < nm$, we omit term $m\log m$ from the overall complexity.} time. Finally, with \sampling modules over $\chi$ feature pairs, the running time is reduced from $O(m^2n)$ time to $O(\chi|\sss|+|\cc|n)$ time, where the first and second term represent the time for candidate generation and candidate validation respectively. Note that $|\cc|$ is typically orders of magnitude smaller than $\chi$ in \horiz, as discussed in Section~\ref{ssec:sampling}.

\stitle{Evaluation Setup.} We implemented the algorithms in C++, and conduct the evaluations on a machine with 16 CPUs and 61.9 GB of RAM.

%==================================================================
\subsection{Comparison of Different Algorithms}
\label{sec:exp_comp}

In this section, we first justify that Rocchio-based measure is a good proxy for the best possible separating score via brute force, i.e., $\theta_{i,j}$. Then we compare the performance of the various selected algorithms in terms of the running time and the separability score for \topthousand feature pairs.

\begin{figure}[h]
\centering %%% not \center
\vspace{-2mm}
\subfigure[Best Feature Pair Comparison]{\label{fig:brute_rocchio_ratio}\includegraphics[width=.235\textwidth]{fig/rocchio_brute_ratio.pdf}}
\subfigure[Measure Comparison]{\label{fig:brute_rocchio_score}\includegraphics[width=.235\textwidth]{fig/rocchio_brute_score.pdf}}
\vspace{-5mm}
\caption{Brute Force vs. Rocchio-based Comparison. (a) For each of 10 runs, we display the ratio of the true separability score between the best feature pair chosen by the brute force and the Rocchio-based measure. (b) For each run, we display the ratio of the true separability score and the Rocchio-based separability score for the best feature pair selected using Rocchio-based measure.}
\vspace{-5mm}
\label{fig:brute_rocchio}
\end{figure}

\stitle{Brute Force vs. Rocchio-based.} As we have discussed in Section~\ref{sec:metric}, when using brute force, we need to consider $O(n^2)$ lines in order to find the best separating line $\ell^* \leftarrow \arg_\ell \max \{\theta_{i,j}^{\ell}\}$, leading to a total running time complexity $O(n^2m^2)$ when considering all feature pairs. An alternative is to use Rocchio-based representative separating line $L$, dramatically reducing $O(n^2)$ to $O(1)$. Since the brute force method becomes computationally infeasible for datasets with large $n$, we compared the Rocchio-based measure to the brute force based measure using a small $\widehat{\oo}$ for 10 datasets in \msig. The average number of objects, i.e., $|\widehat{\oo}|$, is 295. We call the brute force based separability score {\em true} separability score, since it examined all possible separating lines. We first find the best feature pair using Rocchio-based measure and the brute force based measure separately, denoted as $FP_{roc}$ and $FP_{bf}$, then calculate the ratio of the true separability score of $FP_{roc}$ vs. the true separability score of $FP_{bf}$ (Figure~\ref{fig:brute_rocchio_ratio}). We observe that the Rocchio-based measure always picks a best feature pair that has similar true separability score compared to the best one picked by brute force based measure, with the score ratio always better than 0.94. Second, for the best feature pair provided by Rocchio-based measure, we calculate the ratio of the brute force based separability score and Rocchio-based separability score  (Figure~\ref{fig:brute_rocchio_score}). As shown in Figure~\ref{fig:brute_rocchio_score}, the difference between brute force based separability score and Rocchio-based separability score is usually small, with the worst case around being 10\% off.
\agp{odd to have this here when this section is mostly about algo comparison. maybe bring it up later?}

\begin{figure}[h]
\centering %%% not \center
\vspace{-5mm}
\subfigure[\msig]{\label{fig:msig_time}\includegraphics[width=.235\textwidth]{fig/msig_time.pdf}}
\subfigure[\lincs]{\label{fig:lincs_time}\includegraphics[width=.235\textwidth]{fig/lincs_time.pdf}}
\vspace{-5mm}
\caption{Running Time Comparison. A boxplot for each selected algorithm is shown with the median value appearing in matching color above. For each boxplot, whiskers are set to be $1.5\times$ the interquartile range, the outliers are shows as red dots, and the average is marked with as a black star. The number on the top shows the median running time for each algorithm.}
\vspace{-5mm}
\label{fig:time}
\end{figure}

\stitle{Running Time.} Figure~\ref{fig:time} depicts the running times of our different selected algorithms, as stated in evaluation setup. \xsinha{what are the specs of the machine for the tests}. Each plotted box corresponds to one algorithm, representing the running time distribution of finding the \topk feature pairs among all object sets (i.e., 10 gene set collections for \msig and 40 drug treatments for \lincs).

First, let us compare the median running times among different algorithms. For \msig, the \baseline takes more than 2 hours, \earlyOrder takes less than 1 hour, \samp and \sampOpt take around 6 and 5 minutes respectively, while \horiz and \vertic both take only 1 minute on average. Overall, the optimizations result in a reduction of the running time by over $180\times$. In the following, we examine the effect of different modules on the running time. {\em (a.) \earlyT:} we observe that the \earlyT module helps achieve a $2\times$ speed up, with the average number of checked genes reduced from $20K$ to $5K$ (Supplementary Table~\ref{tbl:supplement}); {\em (b.) \sampling:} the \sampling module helps reduce the running time dramatically, with $20\times$ reduction from \baseline to \sampOpt, since on average only $2M$ candidates are generated from all possible $200M$ feature pairs (Supplementary Table~\ref{tbl:supplement}); {\em (c.) \traversal:} by considering approximately $\frac{1}{20}$ of all possible feature pairs, \horiz and \vertic achieve an additional $6\times$ speed-up compared to \sampOpt. This speedup of \horiz and \vertic is limited by the feature ordering overhead (around $6s$) and the time for the \trans module (around $8s$). Also, during the candidate generation phase, \sampOpt is able to prune a greater percentage of feature pairs than \horiz and \vertic, since on average the feature pairs considered in \horiz and \vertic have higher separability scores than those in \sampOpt.
\sinha{previous sentence is not clear}

Next, let us look at the log-scale interquartile range (IQR) of the running times for the different selected algorithms in Figure~\ref{fig:time}. We observe that \earlyOrder has the largest interquartile range, indicating that \earlyT module that tries to reduce the number of objects evaluated for each feature pair is very dependant on the object set and feature values. As we discussed in Section~\ref{ssec:earlyT}, \earlyT has no guarantee on improving the running time. In fact, the algorithm can occasionally be worse than the \baseline as shown in Figure~\ref{fig:lincs_time} because \earlyT incurs additional overhead for checking the pruning and early termination criteria when scanning the object list for each feature pair.
\cb{we need more clarity (maybe in the discussion) why the early termination is included if it is not performing well, i.e. cases where we would recommend using it }

Similar results can be found in Figure~\ref{fig:lincs_time} for \lincs. In this setting with our optimization modules, we observe over a $400\times$ average decrease in the running time of finding the \topk feature pairs that separate the \lincs experiments of a single perturbagen from others. The greatest speedup comes with adding the \sampling module. In this setting, only $100K$ feature pair candidates, i.e., $|\cc|$, are validated out of all $250M$ feature pairs (Supplementary Table~\ref{tbl:supplement}). For the fastest selected algorithms, \horiz and \vertic, the pre-transformation and feature ordering overhead account for an average of $45+35=80s$ of the overall 104 and 94 median seconds respectively.

% (b) the pruning of feature pairs in candidate generation phase is more powerful in \sampOpt, since on average the feature pairs considered in \horiz and \vertic is better than that in \sampOpt intuitively.


\begin{figure}[h]
\centering %%% not \center
\vspace{-5mm}
\subfigure[\msig]{\label{fig:msig_separability}\includegraphics[width=.235\textwidth]{fig/msig_accuracy.pdf}}
\subfigure[\lincs]{\label{fig:lincs_separability}\includegraphics[width=.235\textwidth]{fig/lincs_accuracy.pdf}}
\vspace{-5mm}
\caption{Separability Quality Comparison. Boxplots in the style of Figure~\ref{fig:time} comparing the number of feature pairs each method returned from the 100 best feature pairs of the Baseline method.}
\vspace{-5mm}
\label{fig:separability}
\end{figure}

\sinha{alignment of algorithm names not ok in figure}

\stitle{Separability Quality.} Recall that the \earlyT module is a deterministic optimization and will produce the same \topk feature pairs as the baseline method. The \sampling module, on the other hand, is stochastic and places a probabilistic guarantee on the separability score of the feature pairs and therefore can only provide an approximate estimate of the \topk feature pair ranking. Finally, the \traversal module is heuristic and may output \topk feature pairs that are very different from the ranking produced by the \baseline algorithm. Since \baseline returns the true Rocchio-based separability score of each feature pair, we measured the quality of each selected algorithm by counting the number of common feature pairs returned in the \tophundred between the \baseline and the given algorithm. Figure~\ref{fig:separability} shows this separability quality comparison. Let us first focus on the collection of \msig experiments. \earlyOrder, as expected, has exactly the same separability quality as the \baseline. We also observe that the \samp and \sampOpt rankings are nearly identical to the \tophundred feature pairs of the \baseline, thanks to the probabilistic guarantee as stated in Theorem~\ref{them:est}. The \horiz and \vertic algorithms output a median of 92 and 48 feature pairs in common with \baseline, respectively, because of the heuristic \traversal module. In the \msig results, \horiz performs much better than \vertic, with the median much higher and the interquartile range much narrower as shown in Figure~\ref{fig:msig_separability}. This suggests, as we hypothesized, that interesting separating feature pairs exist outside of only the combinations of the top single features as in \vertic. We repeated this quality analysis for the \lincs experiments and found the \sampling based algorithms returned identical \tophundred feature pairs for all 40 runs. The quality of the \traversal based algorithms was again lower, but for the \lincs experiments, there was not such a large separation in the performance of the \horiz and the \vertic algorithms.

\stitle{Takeaways.} If the accuracy is paramount to the user, \sampOpt is the suggested algorithm; while if the running time is paramount to the user, \horiz is the desired algorithm.

%The p-value from Fisher's exact test represents the significance of deviation from the null hypothesis, i.e., the predicted label is independent of the real label. This significance test is frequently used to characterize gene sets \needcite{}.

%==================================================================
\subsection{Feature Pair vs. Single Feature.}\label{sec:FPvSF}

In this section, we quantify how statistically significant the top ranking results of the selected algorithms are. We show that we are often able to find separating feature pairs that are more significant than the best single separating feature. In order to assess the significance of a separating feature or feature pair, we first calculate the p-value using the one-sided Fisher's exact test on a $2\times2$ contingency table. This contingency table is constructed with the rows being the true positive and negative labels, the columns being the predicted positive and negative labels, and the values being the number of objects that belong to each table cell. Using the Fisher's exact test p-value, we assert that feature pairs can provide better separability performance or new insights compared to single features: (a) {\em better:} feature pairs have improved p-values compared to the corresponding individual features even after the appropriate multiple hypothesis correction and (b) {\em new:} there exist high-ranked feature pairs that are poorly-ranked as individual single features.

\begin{figure}[h]
\centering %%% not \center
\vspace{-5mm}
\subfigure[\msig]{\label{fig:msig_singleF}\includegraphics[width=.16\textwidth]{fig/msig_singleF.pdf}}
\subfigure[\lincs]{\label{fig:lincs_singleF}\includegraphics[width=.31\textwidth]{fig/lincs_singleF.pdf}}
\vspace{-5mm}
\caption{Single Feature Corrected P-value Distribution. For each experiment (x-axis), shows the significance of the \tophundred best single features (blue dots) for the (a) \msig and (b) \lincs datasets. We order the runs by their best corrected single feature p-value, and throw away the runs with no single features whose corrected p-value is better than $10^{-5}$. }
\vspace{-5mm}
\label{fig:singleF}
\end{figure}


\stitle{Single Feature.} Finding \topk single features is a special case of finding feature pairs by setting $i=j$, as remarked in Section~\ref{ssec:trans} and Equation~\ref{eqn:s_object_transform_single}. For each single feature obtained, we compute the p-value with Fisher's exact test, denoted as $pval$. Next, we define the Bonferroni corrected p-value as $corrected\_pval= pval\times m \times n$, since there are $m \times n$ possible hypotheses, one for each possible single feature and separating line. We say a selected feature is significant if the corrected p-value is smaller than the threshold $10^{-5}$, i.e., $-\log_{10} (corrected\_pval)\geq5$. In Figure~\ref{fig:singleF}, we plot the distribution of the corrected p-value of the \tophundred features reported for each dataset in \msig and \lincs. We observe that 10 out of 10 runs in \msig and 32 out of 40 runs in \lincs have significant single features. We will use this ordering and focus on these 10 runs in \msig and 32 runs in \lincs for the remainder of the paper. We observe very small p-values, $\leq 10^{-50}$, in the left part of Figure~\ref{fig:msig_singleF} and \ref{fig:lincs_singleF}, indicating a very strong relationship between the predicted and true object labels. This shows the effectiveness of linear separability and the Rocchio-based separability metric in identifying features that predict the different classes of objects.

%
%This shows the effectiveness of our Rocchio-based

\begin{figure}[h]
\centering %%% not \center
\vspace{-5mm}
\subfigure[\msig]{\label{fig:msig_FP}\includegraphics[width=.16\textwidth]{fig/msig_FP.pdf}}
\subfigure[\lincs]{\label{fig:lincs_FP}\includegraphics[width=.31\textwidth]{fig/lincs_FP.pdf}}
\vspace{-5mm}
\caption{Feature Pairs' Corrected P-value Distribution. Plotted significance for the \tophundred best feature pairs (blue dots) for each experiment in the (a) \msig and (b) \lincs datasets.}
\vspace{-5mm}
\label{fig:FP}
\end{figure}

\stitle{Feature Pair.} We next build the contingency tables and calculate the p-value for the \topk feature pairs. To correct for $m^2$ possible feature pairs and the $n^2$ possible ways to choose the separating lines for each feature pair, we apply a Bonferroni p-value correction to produce the $ corrected\_pval = pval\times m^2 \times n^2$. We plot the distribution of these corrected p-value for the \topk feature pairs in Figure~\ref{fig:FP}. Once again, the threshold for defining a significant feature pair is set to $10^{-5}$. We find that 10 out of 10 runs in \msig and 27 out of 40 runs in \lincs have significant feature pairs by this metric. Comparing the \tophundred single features, Figure~\ref{fig:singleF}, to the \tophundred feature pairs, Figure~\ref{fig:FP}, per dataset, we observe that the corrected p-values of the feature pairs are often more significant and more tightly\sinha{this is not necessarily convincing. Could be fp's that share a feature.} distributed than the single feature corrected p-values. This indicates that our algorithm finds more feature pairs that have better separability quality than the top ranking single features even after accounting for the larger search space. In the following, we further illustrate that feature pairs also provide better and new insights compared to single features.

\begin{figure}[h]
\centering %%% not \center
\vspace{-3mm}
\subfigure[\msig]{\label{fig:msig_histogram_diff}\includegraphics[width=.235\textwidth]{fig/histogram_msig_diff_pval.pdf}}
\subfigure[\lincs]{\label{fig:lincs_histogram_diff}\includegraphics[width=.235\textwidth]{fig/histogram_lincs_diff_pval.pdf}}
\vspace{-5mm}
\caption{Histogram of $improv\_pval$. For the \toptwenty feature pairs from all runs from the (a) \msig and (b) \lincs datasets, distribution of the improvement of the feature pair significance over the corresponding single feature significance. The red line shows the significance threshold of 5.}
\vspace{-5mm}
\label{fig:histogram_diff}
\end{figure}

%Remember that the correction factors for single feature and feature pair are different: $mn$ vs. $m^2n^2$.

\stitle{Improvement from Single Feature to Feature Pair.} We have computed the corrected p-value $corrected\_pval$ for each single feature and feature pair. Now let us examine the improvement of each feature pair from its two corresponding single features in terms of p-value. For each feature pair $(f_i,f_j)$, we define the improved p-value as the ratio between the corrected p-value of $(f_i,f_j)$ and the better of the corrected p-value of $f_i$ or $f_j$, i.e., $improv = \frac{corrected\_pval(f_i,f_j)}{\min(corrected\_pval(f_i),corrected\_pval(f_j))}$. In this set of results, we will focus on \toptwenty feature pairs for each dataset to improve the clarity of the presentation. In Figure~\ref{fig:histogram_diff}, we plot the histogram of $improv\_pval$ for 10 runs in \msig and 32 runs in \lincs. In both datasets, there are many feature pairs (to the right of the red dotted line) in the top 20 that are more significant than their corresponding single features, e.g., on average 8 among \toptwenty feature pairs with $-\log_{10} improv\_pval > 5$ for \lincs. This shows that there is a great improvement from single features to some feature pairs in terms of the separability significance. \sinha{note here about possibility of redundant feature pairs contributing to the histogram.}


% \begin{figure}[h]
% \centering %%% not \center
% \subfigure[\msig]{\label{fig:msig_FP}\includegraphics[width=.16\textwidth]{fig/msig_diff_pval.pdf}} # \tophundred
% \subfigure[\lincs]{\label{fig:lincs_FP}\includegraphics[width=.31\textwidth]{fig/lincs_diff_pval.pdf}}
% \caption{Top-100 Feature Pairs' Corrected P-value Distribution}
% \label{fig:FP}
% \end{figure}

\stitle{New Insight from Feature Pairs.} In order to assess the quality of the top ranking feature pairs, we focused on the \lincs data set where the objects were experimental treatments on the MCF7 breast cancer cell line with the same {\em drug} compound and the features were expression value for different {\em genes}. For the evaluations above, we used object sets for the 40 small molecules with the largest number of LINCS experiments. For the following analysis, we refine our list to drugs that are commonly administered and have at least 60 LINCS experiments on the MCF7 cell line. These nine drugs are vorinostat, trichostatin, estradiol, tamoxifen, doxorubicin, gemcitabine, daunorubicin, idarubicin, and pravastatin. For each chosen drug, we ran our \sampOpt algorithm to rank the \topthousand feature gene pairs for separating the LINCs experiments of the drug from all other MCF7 experiments.

For all drugs, except pravastatin, all of the \topthousand ranked feature pairs were found to be significant ($-\log_{10} corrected\_pval > 5$). The average log significance of the \topthousand feature pairs correlated with the number of experiments ($r=0.91$) in the drug's positive object set (see Table~\ref{tbl:selected_fps}), suggesting the \sampOpt algorithm requires a minimum number of example objects to learn a meaningful separator between the two classes. As described in the Section~\ref{sec:FPvSF}, we are especially interested in feature pairs where its corrected p-value is better than the corrected p-values of its corresponding single features ($-\log_{10} improv\_pval > 0$). We found 1070 feature pairs with improved separability over their single feature among the top1000 of these evaluation drugs sets. One drug, trichostatin, had especially strong performing single features and showed no feature pairs that significantly improved on them. However, for the remaining seven drugs, there were as few as nine significant, improved feature pairs for tamoxifen and many as 369 pairs from doxorubicin (Table~\ref{tbl:selected_fps}).

\begin{table}[t!]
\centering
\small
\begin{tabular}{|l|c|c|c|c|}
\hline
 Drug & NumExprs & Avg Signif & Top1000 Signif & Top1000 Improved\\
\hline
vorinostat & 904 & 235.5 & 1000 & 287\\
\hline
trichostatin & 689 & 277.1 & 1000 & 0\\
\hline
estradiol & 325 & 166.8 & 1000 & 203\\
\hline
tamoxifen & 122 & 105.8 & 1000 & 9\\
\hline
doxorubicin & 104 & 28.0 & 1000 & 369\\
\hline
gemcitabine & 97 & 52.5 & 1000 & 116\\
\hline
daunorubicin & 91 & 40.9 & 1000 & 28\\
\hline
idarubicin & 78 & 30.1 & 1000 & 58\\
\hline
pravastatin & 61 & -7.5 & 0 & 0\\
\hline
Grand Total &  & 43.1 & 9068 & 1070\\
\hline
 \end{tabular}
\caption{For each chosen drug from \lincs, the number of experiments in MCF7 cell line that were performed with that drug (NumExprs), and statistics for the top1000 feature pairs for that drug inclding the average $-\log_{10} corrected\_pval$ (Avg Signif), number of feature pairs with $-\log_{10} corrected\_pval > 5$ (Top1000 Signif), and number with $-\log_{10} improv\_pval > 0$ (Top1000 Improved).}
\label{tbl:selected_fps}
\vspace{-18pt}
\end{table}


Many of these 1070 significantly improved feature pairs are redundant, containing one of the best-ranked single features genes as a member of its feature gene pair. An example of this is with the object set of LINCS experiments for the drug estradiol. We found the gene PRSS23 as the single feature with the highest separability and many feature pairs containing PRSS23 and a second gene as having an improved corrected p-value, for example (PRSS23, RAP1GAP), (PRSS23, TSC22D3), and (PRSS23, BAMBI). We looked for evidence of the relationship between the drug estradiol and these feature pair genes in the Comparative Toxicogenomics Database (CTD) \needcite{CTD} and with our own literature survey. From this search, we found evidence for the pronounced effect of estradiol in increasing expression levels of PRSS23, RAP1GAP, and BAMBI, and decreasing expression of TSC22D3 \needcite{CTD pubmeds}. So although the top single feature reoccurred in multiple top feature pairs, each secondary feature gene was also meaningfully related to the administered drug.

We next examined these top improved feature pairs from \lincs to determine their consistency with existing biological knowledge sets. For our biological knowledge sets, we downloaded gene interaction datasets that had been derived from the databases of STRING, Reactome, Pathway Commons, HumanNet, BioGRID, Intact, DIP and BLAST \needcite{} \cb{may want supplementary methods/tables?}. These downloaded interaction networks covered 23,167 genes and had at least one known interaction between $2.17\%$ of all possible gene pairs. Of the 996 unique feature pairs with significant $improv\_pval$ where both genes mapped onto the genes covered by the interaction networks, 133 gene pairs ($13.4\%$) were found to have at least one known interaction. This six-fold enrichment demonstrates that \genviz more often finds feature pairs of genes that have a known relationship than is expected by chance. One example of such a feature pair is the expression levels of genes GLRX and NME7 are especially good features for separating vorinostat experiments from all others. Not only are both of these genes known to have increased mRNA expression in response to vorinostat \needcite{CTD pubmeds}, but the two genes are annotated by STRING to both be in database pathways of nucleotide biosynthesis, co-express with each other in other model organisms, and be mentioned together often in literature abstracts. \cb{link to STRING interface: https://string-db.org/cgi/network.pl?taskId=oggKTi2cKPAc} Furthermore,  in Section~\ref{ssec:viz} we will demonstrate that the positive objects and negative objects are visually separated under this feature pair, as shown in Figure~\ref{fig:viz}.

In Supplementary Table \cb{in google spreadsheet https://goo.gl/Z5jJ6Q for now}, we examine several top improved feature gene pairs for our \sampOpt runs on our selected drug experiment collections. Of twenty-nine feature pairs in this table, six of them have all three types accompanying evidence, 1) literature relationship between the drug and the first gene, 2) literature relationshipb between the drug and the second gene, and 3) interaction network relationship between the pair of genes. Thirteen have just two of the three types of evidence and there are only eight with no evidence at all \cb{this may improve when I finish additional manual search}. Particularly interesting are the top improved feature pairs in which neither of the single gene features ranked well alone. This is particularly pronounced for the gene pair CDKN1A and CEBPB for separating doxorubicin experiments. Either gene feature alone is not within the top 600 genes for separating doxorubicin experiments from others. However, the combination of the pair is the second most improved feature pair for doxorubicin. This feature pair also has all three types of accompanying evidence; doxorubicin is known to increase expression of CDKN1A \needcite{CTD pubmeds} and CEBPB \needcite{CTD pubmeds}, and the pair of genes are annotated in STRING to have some evidence for co-expression and text mining relationships. \cb{STRING link https://string-db.org/cgi/network.pl?taskId=6L40K8DVTkDJ} This feature pair can be used to form an interesting hypothesis for further analysis or experiment. The potential for finding more significant and previously unidentified features is the reason we designed \genviz to recover top ranking feature pairs instead of just single features.

% \begin{figure}[h]
% \centering %%% not \center
% \subfigure[\msig]{\label{fig:msig_FP}\includegraphics[width=.16\textwidth]{fig/msig_better_rank.pdf}}
% \subfigure[\lincs]{\label{fig:lincs_FP}\includegraphics[width=.31\textwidth]{fig/lincs_better_rank.pdf}}
% \caption{Top-100 Feature Pairs' Corrected P-value Distribution}
% \label{fig:FP}
% \end{figure}

\cb{I did not insert a substitute for this figure~\ref{fig:better_rank_20} (no longer referred to in the text and to be deleted). It may not be as necessary with some of the examples in the supplementary table.}

\begin{figure}[h]
\vspace{-5mm}
 \centering
 \includegraphics[width=0.8\linewidth]{fig/better_rank_20.pdf}
 \vspace{-5mm}
\caption{Comparison of Feature Pair and Single Feature Rankings. The \toptwenty feature pairs from \msig (blue) and \lincs (red) datasets are depicted, with each circle representing a feature pair. The x-axis is the rank of the feature pair among all feature pair candidates, and y-axis is the better rank of its two individual single feature among all single features. The feature pair with the green circle is highlighted in the text. }
\vspace{-5mm}
\label{fig:better_rank_20}
\end{figure}

\sinha{why overlap the \msig and \lincs? suggests a simultaneous examination, which is not what we want here} \agp{yes}

\begin{figure}[h]
\centering %%% not \center
\vspace{-3mm}
\subfigure[\msig]{\label{fig:viz_msig}\includegraphics[width=.235\textwidth]{fig/ratioHist_msig_1.png}}
\subfigure[\lincs]{\label{fig:viz_lincs}\includegraphics[width=.235\textwidth]{fig/ratioHist_newlincs_1.png}}
\vspace{-5mm}
\caption{Visualization Output of \genviz. Heatmap visualization with the pair of top features providing the x and y axes and the name of the run providing the plot title. The relative density of objects determines the color of the heatmap cells with blue indicating a greater proportion of positive objects and red indicating a greater proportion of negative objects. The class centroids are represented by blue and red circles, positive and negative classes respectively. The two examples shown have the most $improv\_pval$ for the (a) \msig and (b) \lincs datasets. }
\vspace{-5mm}
\label{fig:viz}
\end{figure}


%==================================================================
\subsection{Output Visualization}\label{ssec:viz}

As discussed in Section~\ref{sec:intro}, the output of \genviz is not simply a ranking of the top feature pairs with their separability scores, but also a visualization that helps users to interpret the separability of the object classes. In Figure~\ref{fig:viz}, we depict sample output visualizations from the \msig and \lincs runs. We pick the feature pair with the highest improved p-value, i.e., $improv\_pval$, using \sampOpt in \msig, and the feature pair (GLRX, NME7) in \lincs as described in the ``new insight'' section above. We use a heatmap to illustrate the distribution of positive objects (in blue) and negative objects (in red) under the selected feature pair $(f_i,f_j)$, where $f_i$ is x-axis, $f_j$ is y-axis. The color represents the difference between the weighted number of positive objects and negative objects. If the number of positive objects is larger than negative objects in the cell, the cell is colored as blue, otherwise, it is red. For the ease of interpretation, we also add the centroids of positive and negative objects using red circle and blue circle respectively. For \msig, we observe that the stationary probability for negative objects is clustered around zero, while positive objects have higher stationary probability overall. For \lincs, positive objects all have one gene probe up and one probe down, much different from the negative objects with zero-mean Gaussian distribution. \cb{panel a relates to (DELYS\_THYROID\_CANCER\_DN, response to reactive oxygen species, cell adhesion) which I cannot really appreciate other than the fact that cancer gene sets often relate to cell adhesion. panel b relates to (Vorinostat, TRAK2, KEAP1) and there is a paper that shows that vorinostat with other drugs induces upregulation of KEAP1, however I found nothing related to TRAK2.} Hence, outputting the visualization is a good way to help users better explain the separability and understand the data.


\cb{SS would like us to figure out how to make a simple interface for this. so we need a design of configure page, design of results page, and examples how to call backend code}

\cb{for the discussion: non-linear features?}

\cb{Low priority, but I am interested in knowing for all the feature pairs we return that are better/new. what type of relationship between the pair are we finding. maybe captured by the slope of the perp bisector?}

\cb{again low priority, but how robust are these results. If you run a sampling method twice, do you get the same feature pairs. if you change the positive set slightly, are your results comparable?}


\appendix
\section{Appendix}\label{sec:supplement}

\begin{table*}[t]
\centering
\small
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|}
 \hline
  & & \multirow{ 2}{*}{\baseline} & \multirow{ 2}{*}{\earlyOrder} & \multicolumn{2}{c|}{\samp}
& \multicolumn{2}{c|}{\sampOpt} & \multicolumn{2}{c|}{\horiz} & \multicolumn{2}{c|}{\vertic} \\
    \cline{5-12}
& & & & Phase 1 & Phase 2
& Phase 1 & Phase 2 & Phase 1 & Phase 2 & Phase 1 & Phase 2 \\
    \hline
\multirow{ 2}{*}{\msig} & FPs Checked & $2\times 10^8$ & $2\times 10^8$ & $2\times 10^8$ & $4.1\times 10^6$ & $1.9 \times 10^8$ & $2.3\times 10^6$ & $10^7$ & $1.2\times 10^6$ & $10^7$ & $5.5\times 10^5$ \\
\cline{2-12}
 & Objects Checked & 22209 & 5427 & 694 & 22209 & 694 & 22209 & 694 & 22209 & 694 & 22209 \\
    \hline
\multirow{ 2}{*}{\lincs} & FPs Checked & $2.5\times 10^8$ & $2.5\times 10^8$ & $2.5\times 10^8$ & $3.3\times 10^5$ & $2.5\times 10^8$ & $9.1\times 10^4$ & $10^7$ & $8.6\times 10^4$ & $10^7$ & $5.0\times 10^4$\\
\cline{2-12}
 & Objects Checked & 98061 & 15631 & 564 & 98061 & 564 & 98061 & 564 & 98061 & 564 & 98061 \\
    \hline
    \end{tabular}
\caption{Supplement Table for Running Time Comparison in Figure~\ref{fig:time}. FPs represents feature pairs.}
\label{tbl:supplement}
\end{table*}


\eat{

xxx[y]

[y]}