Department of Computer Science and Automation Department of Computer Science and Automation, IISc, Bangalore, India Indian Institute of Science
HOME | ABOUT US | PEOPLE | RESEARCH | ACADEMICS | FACILITIES | EVENTS / SEMINARS | NEWS | CONTACT US

UPCOMING SEMINARS

Series: Department Seminar
Title: The Construction of Semidefinite Representation of Convex Set from its Projections

  • Speaker: Dr. Anusuya Ghosh
                   Post-doctoral fellow in IIM Bangalore
  • Date and Time: Tuesday, May 02, 2017, 10:00 AM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
The construction methods of semidefinite representation of convex sets draw a major role in recent research on modern convex optimization. As we know, if a convex set is semidefinite representable, there is no specific way to construct its semidefinite representation. This work deals with a construction method of semidefinite representation of convex body if its projections on lower-dimensional space are known. An algorithm has been proposed and an approximate solution has been achieved for such construction. Further we show that the approximate solution converges to the convex set. We consider a detailed comparison of our construction method with other methods (which are available in recent literature).

Speaker Bio:
Anusuya Ghosh is currently working as Post-doctoral fellow in IIM Bangalore. She received her PhD in Industrial Engineering and Operations Research from IIT Bombay. She also received "Best Project Award" from department of mathematics in IIT Kharagpur in 2009 (during M. Sc.).

Host Faculty: Chiranjib Bhattacharyya

Video Archive Go Top

 

Series: M.Sc. (Engg) Colloquium
Title: A Study of Thompson Sampling Approach for Sleeping Multi-Armed Bandit Problems

  • Speaker: Mr. Aritra Chatterjee
                   M.Sc. (Engg) Student, CSA, IISc
  • Faculty Advisor: Prof. Y. Narahari
  • Date and Time: Friday, April 28, 2017, 4:00 PM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
The Multi-armed bandit (MAB) problem provides a convenient abstraction for many online learning problems arising in modern applications including Internet advertising, crowdsourcing, online procurement, smart grids, etc. Several variants of the MAB problem have been proposed to extend the basic model to a variety of practical and general settings. The sleeping multi-armed (S-MAB) problem is one such variant where the subset of available arms varies with time. This study is focused on analyzing the efficacy of the Thompson Sampling algorithm for solving the S-MAB problem.

Any algorithm for the classical MAB problem is expected to choose one of K available arms (actions) in each of T consecutive rounds. Each choice of an arm generates a stochastic reward from an unknown but fixed distribution. The goal of the algorithm is to maximize the expected sum of rewards over the T rounds (or equivalently minimize the expected total regret), relative to the best fixed action in hindsight. In many of these settings, however, not all arms may be available in any given round. For example, in Internet advertising, some advertisers might choose to stay away from the auction due to budget constraints; in crowdsourcing, some workers may not be available at a given time due to timezone difference, etc. This implies that the algorithm needs to work only with the set of available arms. Further, unlike in the classical setting, the performance of an algorithm can no longer be evaluated relative to the best fixed arm in hindsight; the regret is now to be measured relative to the best "available" arm in hindsight. One possible way is to compute the overall regret as the worst-case regret over all possible sample paths of availabilities (that is,under adversarial selection of available arms).

In the literature, upper confidence bound (UCB)-based approaches have been proposed and investigated for the S-MAB problem. Our contribution is to investigate a Thompson sampling based approach. Our key finding is to establish a logarithmic regret bound, which non-trivially generalizes a similar bound known for this approach in the classical MAB setting. Our bound also matches (up to constants) the best-known lower bound for the S-MAB problem. And, we show via detailed simulations, that the Thompson Sampling approach in fact outperforms the known algorithms for the S-MAB problem.

Video Archive Go Top

 

PAST SEMINARS

Series: Ph.D. Colloquium
Title: New Methods for Learning from Heterogeneous and Strategic Agents

  • Speaker: Divya Padmanabhan (PhD student)
  • Faculty Advisor: Prof. Shirish Shevade, Prof. Y. Narahari
  • Date and Time: Friday, April 21, 2017, 12:00 PM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
In this thesis, we address several problems concerned with learning from multiple heterogeneous agents. The problems are relevant to several applications today such as crowdsourcing and sponsored search auctions. The agents are noisy in scenarios such as crowdsourcing and provide the training data for the learning problems. However the noise levels of the agents are unknown. Any learning algorithm which relies on information provided by these agents must learn their qualities while simultaneously learning a model. Another challenge arises when agents are strategic. The agents could be strategic on the efforts they put in towards performing a task. They could also be strategic on reporting the costs they incur. The learning algorithms suffer if incorrect information is provided by the agents. Therefore it is necessary to ensure that the agents exhibit desirable behaviour. We solve the following problems that arise while learning from such agents.

(1)Multi-label Classification with Heterogeneous Noisy Agents

Multi-label classification is a well-known supervised machine learning problem where each instance is associated with multiple classes. Since several labels can be assigned to a single instance, one of the key challenges in this problem is to learn the correlations between the classes. We first assume labels from a perfect source and ropose a novel topic model called Multi-Label Presence-Absence Latent Dirichlet Allocation (ML-PA-LDA). In the current era, a natural source for procuring the training dataset is through mining user-generated content or directly through users in a crowdsourcing platform. In this more practical scenario of crowdsourcing, an additional challenge arises as the labels of the training instances are provided by noisy, heterogeneous crowd-workers with unknown qualities. With this motivation, we further augment our topic model to the scenario where the labels are provided by multiple noisy sources and refer to this model as ML-PA-LDA-MNS. With experiments on standard datasets, we show that the proposed models achieve superior performance over state of the art.

(2) Active Linear Regression with Heterogeneous, Noisy and Strategic Agents

We study the problem of training a linear regression model by procuring labels from multiple noisy agents or crowd annotators, under a budget constraint. We propose a Bayesian model for linear regression from multiple noisy sources and use variational inference for parameter estimation. When labels are sought from agents, it is important to minimize the number of labels procured as every call to an agent incurs a cost. Towards this, we adopt an active learning approach. In this specific context, we prove the equivalence of well-studied criteria of active learning like entropy minimization and expected error reduction. For the purpose of annotator selection in active learning, we observe a useful connection with the multi-armed bandit framework. Due to the nature of the distribution of the rewards on the arms, we resort to the Robust Upper Confidence Bound (UCB) scheme with truncated empirical mean estimator to solve the annotator selection problem. This yields provable guarantees on the regret. We further apply our model to the scenario where annotators are strategic and design suitable incentives to induce them to put in their best efforts.



(3)Ranking with Heterogeneous Strategic Agents

We look at the problem where a planner must rank multiple strategic agents, a problem that has many applications including sponsored search auctions (SSA). Stochastic multi-armed bandit (MAB) mechanisms have been used to solve this problem. Existing stochastic MAB mechanisms with a deterministic payment rule, proposed in the literature, necessarily suffer a regret of [image: Omega(T^{2/3})], where [image: T] is the number of time steps. This happens because the existing mechanisms consider the worst case scenario where the means of the agents' stochastic rewards are separated by a very small amount that depends on [image: T]. Our idea is to allow the planner to indicate the resolution, [image: Delta], with which the agents must be distinguished. This immediately leads us to introduce the notion of [image: Delta]-Regret. We propose a dominant strategy incentive compatible (DSIC) and individually rational (IR), deterministic MAB mechanism, based on ideas from the Upper Confidence Bound (UCB) family of MAB algorithms. The proposed mechanism [image: Delta]-UCB achieves a [image: Delta]-regret of [image: O(log T)]. We first establish the results for single slot SSA and then non-trivially extend the results to the case where multiple slots are to be allocated.

Video Archive Go Top

 

Series: M.Sc. (Engg) Colloquium
Title: An improved lower bound for multi-r-ic depth four circuits as a function of the number of input variables

  • Speaker: Mr. Sumant Hegde
                   M.Sc. (Engg) student at CSA department
  • Faculty Advisor: Dr. Chandan Saha
  • Date and Time: Friday, April 21, 2017, 11:00 AM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
In this work we study the multi-r-ic formula model introduced by Kayal and Saha (2015) and improve upon the lower bound for multi-r-ic depth four circuits given in the work by Kayal, Saha and Tavenas (2016) [KST16], when viewed as a function of the number of input variables N. The improvement leads to superpolynomial lower bounds for values of r significantly higher than what is known from prior works.

A (syntactically) multi-r-ic formula is an arithmetic formula in which the formal degree with respect to every variable is at most r at every gate. The formal degree of an input gate with respect to a variable x is defined to be 1 if the gate is labelled with x and 0 if it is labelled with a field element or a different variable. The formal degree of a sum (respectively, product) gate with respect to x is defined as the maximum (respectively, sum) of the formal degrees of its children with respect to x. A multi-r-ic formula computes a polynomial with individual degree of every variable bounded by r.

Multi-r-ic formulas are a natural extension of the relatively well-studied multilinear formulas in the work by Raz (2009) and Raz and Yehudayoff (2009) [RY09]. In this work, we focus on multi-r-ic formulas that compute multilinear polynomials. They are interesting because they allow the formal degree of the formula to be as high as r times the number of underlying variables. This gives extra room for ‘clever’ cancellations of the high degree components inside the formula thereby making this type of formulas harder to analyse (as formula homogenization is not known to be doable without blowing up the size superpolynomially unless degree is very small, as Raz (2010) showed). Most lower bound proofs in the literature operate under the restriction of low formal degree or multilinearity (as in the works by Raz (2009), [RY09], Kayal, Saha and Saptharishi (2014) and Kayal, Limaye, Saha and Srinivasan (2014)). In this light, multi-r-ic formulas computing multilinear polynomials form a reasonable intermediate model to study in order to gain some insight on how to deal with high formal degree in general formulas. Another motivation for understanding the high formal degree case better (even at depth three) comes from the depth reduction result by Gupta, Kamat, Kayal and Saptharishi (2013).

With the aim of making progress on multi-r-ic formula lower bound, [KST16] gave a (N/(d·r^2))^Ω(√(d/r))lower bound for multi-r-ic depth four formulas computing the N-variate Iterated Matrix Multiplication (IMM) polynomial of degree d. As a function of N, the lower bound is at most 2^Ω(√(N/r^3)) when d = Θ(N/r^2). In this thesis, our focus is on getting multi-r-ic depth four formulas with larger r into the arena of models that provenly admit a superpolynomial lower bound. In [KST16], r can be at most N^(1/3) for the bound to remain superpolynomial. Our result (stated below) gives a superpolynomial lower bound for multi-r-ic depth four formulas where r can be as high as (N·log N)^(0.9).

Theorem: Let N, d, r be positive integers such that 0.51·N ≤ d ≤ 0.9·N and r ≤ (N·log N)^(0.9). Then there is an explicit N-variate degree-d multilinear polynomial in VNP such that any multi-r-ic depth four circuit computing it has size 2^Ω(√(N·(log N)/r)).

The theorem yields a better lower bound than that of [KST16], when viewed as a function of N. Also, the bound matches the best known lower bound (as a function of N) for multilinear (r = 1) depth four circuits, given by [RY09], which is 2^Ω(√(N·log N)).

The improvement is obtained by analysing the shifted partials dimension (SPD) of an N-variate polynomial in VNP (as opposed to a VP polynomial in [KST16]) of high degree range of Θ(N), and comparing it with the SPD of a depth four multi-r-ic circuit. In [KST16] a variant of shifted partials, called shifted skewed partials, is critically used to analyse the IMM polynomial (which is in VP). We observe that SPD (without “skew”) is still effective for the Nisan-Wigderson polynomial (which is in VNP), and yields a better lower bound as a function of N when degree d is naturally chosen to be high.

Our analysis gives a better range for r and a better lower bound in the high degree regime, not only for depth four multi-r-ic circuits but also for the weaker models: multi-r-ic depth three circuits and multi-r-ic depth four circuits with low bottom support. These (weaker) models are instrumental in gaining insight about general depth four multi-r-ic circuits, both in [KST16] and our work.

Video Archive Go Top

 

Series: Ph.D. Thesis Defense
Title: Scalable Sparse Bayesian Nonparametric and Matrix Tri-factorization Models for Text Mining applications

  • Speaker: Ranganath B. N
  • Faculty Advisor: Prof. Shalabh Bhatnagar
  • Date and Time: Tuesday, April 18, 2017, 2:00 PM
  • Venue: CSA Conference Room (Room No. 225, First Floor)

Abstract
Hierarchical Bayesian Models and Matrix factorization methods provide an unsupervised way to learn latent components of a data from the grouped or sequence data. For example, in document data, latent component corresponds to topic with each topic as a distribution over a finite vocabulary of words. For many applications, there exist sparse relationships between the domain entities and the latent components of the data. Traditional approaches for topic modeling do not take into account these sparsity considerations. Modeling these sparse relationships helps in extracting relevant information leading to improvements in topic accuracy and scalable solution. In our thesis, we explore these sparsity relationships for different applications such as text segmentation, topical analysis and entity resolution in dyadic data through the Bayesian and Matrix tri-factorization approaches, proposing scalable solutions.

In our first work, we address the problem of segmentation of a collection of sequence data such as documents using probabilistic models. Existing state-of-the-art Hierarchical Bayesian Models are connected to the notion of Complete Exchangeability or Markov Exchangeability. Bayesian Nonparametric Models based on the notion of Markov Exchangeability such as HDP-HMM and Sticky HDP-HMM, allow very restricted permutations of latent variables in grouped data (topics in documents), which in turn lead to computational challenges for inference. At the other extreme, models based on Complete Exchangeability such as HDP allow arbitrary permutations within each group or document, and inference is significantly more tractable as a result, but segmentation is not meaningful using such models. To overcome these problems, we explored a new notion of exchangeability that lies between Markov Exchangeability and Complete Exchangeability for which segmentation is meaningful, but inference is computationally less expensive than both Markov and Complete Exchangeability. Parametrically, Block Exchangeability contains sparser number of transition parameters, linear in number of states compared to the quadratic order for Markov Exchangeability that is still less than that for Complete Exchangeability and for which parameters are on the order of the number of documents. For this, we propose a nonparametric Block Exchangeable model(BEM) based on the new notion of Block Exchangeability, which we have shown to be a superclass of Complete Exchangeability and subclass of Markov Exchangeability. We propose a scalable inference algorithm for BEM to infer the topics for words and segment boundaries associated with topics for a document using the collapsed Gibbs Sampling procedure. Empirical results show that BEM outperforms state-of-the-art nonparametric models in terms of scalability and generalization ability and shows nearly the same segmentation quality on News dataset, Product review dataset and on a Synthetic dataset. Interestingly, we can tune the scalability by varying the block size through a parameter in our model for a small trade-off with segmentation quality.

In addition to exploring the association between documents and words, we also explore the sparse relationships for dyadic data, where associations between one pair of domain entities such as (documents, words) and associations between another pair such as (documents, users) are completely observed. We motivate the analysis of such dyadic data introducing an additional discrete dimension, which we call topics, and explore sparse relationships between the domain entities and the topic, such as of user-topic and document-topic respectively.

In our second work, for this problem of sparse topical analysis of dyadic data, we propose a formulation using sparse matrix tri-factorization. This formulation requires sparsity constraints, not only on the individual factor matrices, but also on the product of two of the factors. To the best of our knowledge, this problem of sparse matrix tri-factorization has not been studied before. We propose a solution that introduces a surrogate for the product of factors and enforces sparsity on this surrogate as well as on the individual factors through L1-regularization. The resulting optimization problem is efficiently solvable in an alternating minimization framework over sub-problems involving individual factors using the well known FISTA algorithm. For the sub-problems that are constrained, we use a projected variant of the FISTA algorithm.

We also show that our formulation leads to independent sub-problems towards solving a factor matrix, thereby supporting parallel implementation leading to a scalable solution. We perform experiments over bibliographic and product review data to show that the proposed framework based on sparse tri-factorization formulation results in better generalization ability and factorization accuracy compared to baselines that use sparse bi-factorization.

Even though the second work performs sparse topical analysis for dyadic data, finding sparse topical associations for the users, the user references with different names could belong to the same entity and those with same names could belong to different entities. The problem of entity resolution is widely studied in the research community, where the goal is to identify real users associated with the user references in the documents.

Finally, we focus on the problem of entity resolution in dyadic data, where associations between one pair of domain entities such as documents-words and associations between another pair such as documents-users are observed, an example of which includes bibliographic data.

In our final work, for this problem of entity resolution in bibliographic data, we propose a Bayesian nonparametric `Sparse entity resolution model' (SERM) exploring the sparse relationships between the grouped data involving grouping of the documents, and the topics/author entities in the group. Further, we also exploit the sparseness between an author entity and the associated author aliases. Grouping of the documents is achieved with the stick breaking prior for the Dirichlet processes (DP). To achieve sparseness, we propose a solution that introduces separate Indian Buffet process (IBP) priors over topics and the author entities for the groups and k-NN mechanism for selecting author aliases for the author entities.

We propose a scalable inference for SERM by appropriately combining partially collapsed Gibbs sampling scheme in Focussed topic model (FTM), the inference scheme used for parametric IBP prior and the k-NN mechanism. We perform experiments over bibliographic datasets, Citeseer and Rexa, to show that the proposed SERM model improves the accuracy of entity resolution by finding relevant author entities through modeling sparse relationships and is scalable, when compared to the state-of-the-art baseline.

Video Archive Go Top

 

Series: Department Seminar
Title: Identifying groups of strongly correlated variables through Smoothed Ordered Weighted l1-norms

  • Speaker: Mr. Raman Sankaran
                   PhD student, CSA
  • Faculty Advisor: Prof. Chiranjib Bhattacharyya
  • Date and Time: Thursday, April 13, 2017, 3:00 PM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
The failure of LASSO to identify groups of correlated predictors in linear regression has sparked significant research interest. Re- cently, various norms [1, 2] were proposed, which can be best described as instances of ordered weighted l1 norms (OWL) [3], as an alternative to l1 regularization used in LASSO. OWL can identify groups of cor- related variables but it forces the model to be constant within a group. This artifact induces unnecessary bias in the model esti- mation. In this paper we take a submodu- lar perspective and show that OWL can be posed as the Lov ́asz extension of a suitably defined submodular function. The submodu- lar perspective not only explains the group- wise constant behavior of OWL, but also sug- gests alternatives. The main contribution of this paper is smoothed OWL (SOWL), a new family of norms, which not only identifies the groups but also allows the model to be flexi- ble inside a group. We establish several algo- rithmic and theoretical properties of SOWL including group identification and model con- sistency. We also provide algorithmic tools to compute the SOWL norm and its proxi- mal operator, whose computational complex- ity (O(dlogd)) is significantly better than that of general purpose solvers (O(d2 log d)). In our experiments, SOWL compares favor- ably with respect to OWL in the regimes of interest.

Speaker Bio:
Raman Sankaran is a PhD student in the Department of CSA This is a practice talk for AISTATS.

Video Archive Go Top

 

Series: M.Sc. (Engg) Colloquium
Title: Supervised Classification of Missense Mutations as Pathogenic or Tolerated using Ensemble Learning Methods.

  • Speaker: Ms. Rashmi Balasubramanyam
                   M.Sc Student
  • Faculty Advisor: Prof Chiranjib Bhattacharyya
  • Date and Time: Thursday, April 06, 2017, 9:30 AM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
Missense mutations account for more than 50% of the mutations known to be involved in human inherited diseases. Missense classification is a challenging task that involves sequencing of the genome, identifying the variations, and assessing their deleteriousness. This is a very laborious, time and cost intensive task to be carried out in the laboratory. Advancements in bioinformatics have led to several large-scale next-generation genome sequencing projects, and subsequently the identification of genome variations. Several studies have combined this data with information on established deleterious and neutral variants to develop machine learning based classifiers.

There are significant issues with the missense classifiers due to which missense classification is still an open area of research. These issues can be classified under two broad categories: (a) Dataset overlap issue - where the performance estimates reported by the state-of-the-art classifiers are overly optimistic as they have often been evaluated on datasets that have significant overlaps with their training datasets. Also, there is no comparative analysis of these tools using a common benchmark dataset that contains no overlap with the training datasets, therefore making it impossible to identify the best classifier among them. Also, such a common benchmark dataset is not available. (b) Inadequate capture of vital biological information of the protein and mutations - such as conservation of long-range amino acid dependencies, changes in certain physico-chemical properties of the wild-type and mutant amino acids, due to the mutation. It is also not clear how to extract and use this information. Also, some classifiers use structural information that is not available for all proteins.

In this study, we compiled a new dataset, containing around 2 - 15% overlap with the popularly used training datasets, with 18,036 mutations in 5,642 proteins. We reviewed and evaluated 15 state-of-the-art missense classifiers - SIFT, PANTHER, PROVEAN, PhD-SNP, Mutation Assessor, FATHMM, SNPs&GO, SNPs&GO-3D, nsSNPAnalyzer, PolyPhen-2, SNAP, MutPred, PON-P2, CONDEL and MetaSNP, using the six metrics - accuracy, sensitivity, specificity, precision, NPV and MCC. When evaluated on our dataset, we observe huge performance drops from what has been claimed. Average drop in the performance for these 13 classifiers are around 15% in accuracy, 17% in sensitivity, 14% in specificity, 7% in NPV, 24% in precision and 30% in MCC. With this we show that the performance of these tools is not consistent on different datasets, and thus not reliable for practical use in a clinical setting.

As we observed that the performance of the existing classifiers is poor in general, we tried to develop a new classifier that is robust and performs consistently across datasets, and better than the state-of-the-art classifiers. We developed a novel method of capturing long-range amino acid dependency conservation by boosting the conservation frequencies of substrings of amino acids of various lengths around the mutation position using AdaBoost learning algorithm. This score alone performed equivalently to the sequence conservation based tools in classifying missense mutations. Popularly used sequence conservation properties was combined with this boosted long-range dependency conservation scores using AdaBoost algorithm. This reduced the class bias, and improved the overall accuracy of the classifer. We trained a third classifier by incorporating changes in 21 important physico-chemical properties, due to the mutation. In this case, we observed that the overall performance further improved and the class bias further reduced. The performance of our final classifier is comparable with the state-of-the-art classifiers. We did not find any significant improvement, but the class-specific accuracies and precisions are marginally better by around 1-2% than those of the existing classifiers. In order to understand our classifier better, we dissected our benchmark dataset into: (a) seen and unseen proteins, and (b) pure and mixed proteins, and analyzed the performance in detail. Finally we concluded that our classifier performs consistently across each of these categories of seen, unseen, pure and mixed protein.

Video Archive Go Top

 

Series: Ph.D. Thesis Defense
Title: Program Repair by Automated Generation of Hints

  • Speaker: Ms. Shalini Kaleeswaran
                   Ph.D student
  • Faculty Advisor: Prof. Aditya Kanade
  • Date and Time: Wednesday, April 05, 2017, 12:00 PM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
Programming has become an important skill in today's technology-driven world. It is a complex activity because of which programmers make mistakes in their software. Student programmers make mistakes in their programs due to lack of programming experience and also due to lack of understanding of the algorithmic concepts being used. An experienced programmer in the industry, on the other hand, makes mistakes in the software he develops because of the mere complexity of an industry-grade software. Debugging one's programs, i.e., finding and fixing faults is one of the challenging activities for a programmer. This necessitates the need to develop automated techniques that are effective in assisting programmers repair faults in their programs.

In this thesis, we present new computer-assisted approaches to help programmers repair their programs. Finding repair for a fault in the program is essentially a search over the space of all possible repairs. Traditional repair techniques output a complete repaired program that eliminates the fault in the original program. This ambitious goal of finding a complete repair is often unachievable because the search space becomes too large to navigate efficiently. The key insight of our work is that programmers only need guidelines that help them understand the faults they made and suggestions on how to fix them, as opposed to a complete repair. Therefore, this thesis proposes a novel perspective of solving the program repair problem, where we generate textual hints that direct the programmer on how to fix the fault. A hint specifies which part of the program needs modification as well as how to modify it. For student programmers, our hints help them reflect on the mistakes they made and improve their understanding of the problem being solved in the program. For experienced programmers, our hints help them easily identify the cause of fault and guide them to arrive at a repair. Our approaches use novel combinations of syntactic, symbolic and statistical reasoning techniques to achieve this goal of semi-automated program repair.

Our first contribution is an algorithmic technique for automated synthesis of repair hints. Our technique takes as input a faulty program and a set of test cases and outputs a ranked list of repair hints that suggest how to rectify a faulty statement in the program. In this technique, we leverage symbolic execution and statistical correlation analysis to identify expressions that are likely to occur in the repaired code. Then we use syntactic pattern matching to combine these expressions algorithmically and generate repair hints. Since we restrict ourselves to finding ``building blocks'' of repair, our hints are very effective in guiding the programmer to the final repair, even in scenarios where a complete repair cannot be obtained. We implemented this technique to develop a tool called ``MintHint'', that performs automated synthesis of repair hints for C programs. We evaluated the effectiveness of MintHint by performing a user study and found that programmers who used MintHint were able to repair more faults compared to those who didn't. In addition, they obtained the repair 5.8 times faster.

Our next contribution is a semi-supervised feedback generation methodology for student programming assignments. In this methodology, we take a set of student submissions as input and generate personalized, verified feedback for each submission that suggests modifications to fix faults in the program, if any. Student submissions are first clustered by solution strategy, which is followed by an automated feedback generation phase. We use unsupervised clustering to reduce the burden on the instructor in providing reference implementations for each possible solution strategy. The instructor is called upon simply to identify a correct submission in each cluster, which acts as the reference implementation for all submissions in the cluster. In the next phase, we use a novel counter-example guided feedback generation algorithm that compares a student submission with a reference implementation to generate verified feedback. We applied this methodology to programs implementing iterative dynamic programming algorithms and developed a tool called ``DPAssist'' that works on C programs. We evaluated DPAssist on a dataset of 2226 student submissions to 4 programming problems from the programming contest site CodeChef and successfully generated verified feedback for 85% of them.

Video Archive Go Top

 

Series: M.Sc. (Engg) Thesis Defense
Title: Learning Tournament Solutions from Preference-based Multi-Armed Bandits

  • Speaker: Mr. Siddartha Ramamohan
                   M.Sc (Engg.) student at CSA department
  • Faculty Advisor: Prof. Chiranjib Bhattacharyya
  • Date and Time: Thursday, March 30, 2017, 10:00 AM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
We consider the dueling bandits problem, a sequential decision task where the goal is to learn to pick ‘good’ arms out of an available pool by actively querying for and observing relative preferences between selected pairs of arms. The noisy observed preferences are assumed to be generated by a fixed but unknown stochastic preference model. Motivated by applications in information retrieval, e-commerce, crowd-sourcing, etc, a number of bandit algorithms have been proposed in recent years for this task. These have mostly addressed restricted settings wherein the underlying preference model satisfies various structural assumptions; such as being based on a random utility function or a feature-space embedding, or satisfying transitivity or sparsity properties, or at least possessing a Condorcet winner – a single ‘best’ arm that is preferred to all others. In seeking to move beyond such restricted settings, there has been a recent shift towards alternative notions of ‘good’ arms (including Borda, Copeland and von Neumann winners).

In this work, we extend the dueling bandits problem by adopting, as the desired target set of good (or ’winning’) arms, a number of tournament solutions that have been proposed in social choice and voting theory literature as embodying principled and natural criteria for identifying good arms based on preference relations. We then propose a family of upper confidence bound (UCB) based dueling bandit algorithms that learn to play winning arms from several popular tournament solutions: the top cycle, uncovered set, Banks set and Copeland set.

We derive these algorithms by first proposing a generic UCB-based framework algorithm that can be instantiated for different tournament solutions by means of appropriately designed ‘selection procedures’. We show sufficiency conditions for the resulting dueling bandit algorithms to satisfy distribution-dependent, horizon-free bounds on natural regret measures defined w.r.t. the target tournament solutions. In contrast to previous work, these bounds do not require restrictive structural assumptions on the preference model and hold for a range of different tournament solutions.

We develop selection procedures that satisfy the sufficiency conditions for several popular tournament solutions, yielding dueling bandit algorithms UCB-TC, UCB-UC, UCB-BA and UCB-CO for the top cycle, uncovered set, Banks set and the Copeland set respectively. The O((K^2 ln⁡T)⁄g^2 ) bounds we derive are optimal in their dependence on the time horizon T; and we show that the distribution-dependent ‘margin’ g is lower-bounded by the separation or the relative advantage of top cycle arms over non-top cycle arms.

We empirically validate these claims and evaluate the proposed algorithms, comparing them to dueling bandit algorithms RUCB, SAVAGE and BTMB over synthetic and real-world preference models. We show that the UCB-TS algorithms perform competitively over models that possess a Condorcet winner, but out-perform the other algorithms over more general models that do not possess a Condorcet winner.

Video Archive Go Top

 

Series: Ph.D. Thesis Defense
Title: New Techniques for Automatic Short Answer Grading

  • Speaker: Mr. Shourya Roy
                    Xerox Research Centre India
                    Bangalore
  • Faculty Advisor: Prof. Y. Narahari (IISc) Dr. Manish Gupta (Xerox Research Centre India)
  • Date and Time: Tuesday, March 28, 2017, 3:00 PM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
Assessing learners' acquired knowledge by students is a key aspect of the pedagogical ecosystem. Conducting periodic assessment exercises by manual means is clearly monotonous, non-rewarding, and moreover, infeasible for large student populations such as in massive online open courses. The scope of computer assisted assessment has been predominantly constrained to "recognition" type questions (for example, multiple choice questions). Research towards automatic grading of students' constructed responses is still in infancy and this thesis represents an attempt to advance the state-of-the-art in this area. In particular, the thesis focuses attention on automatic short answer grading (ASAG) which involves automatically grading short answers to questions in natural language, having a length of a few words to a few sentences. We propose new techniques and novel algorithms that could become the core of ASAG systems.

Unsupervised Approach

We first explore approaches based on textual similarity measures. We advance the current art by applying word-embedding models and showing that they provide a strong baseline. Second, we focus our attention on the vexed problem of unreliable performance of unsupervised ASAG techniques caused by their sole reliance on instructor provided model answers. We make an intriguing observation, "Wisdom of students": students' answers to a question contain significant lexical commonalities and more often than not, the commonalities are related to the correct answer to the question. Harnessing classical sequential pattern mining techniques, we propose a "fluctuation smoothing" approach that not only improves the performances of ASAG techniques but makes them more reliable irrespective of how the model answers are articulated.

Supervised Approach

Most prior work in supervised ASAG systems have considered ground-truth scores of nominal or continuous type. We show that an ordinal regression formulation with discrete ratio scores yields a superior performance than classification or regression. Besides, we demonstrate that considering students' responses in an ensemble along with model answer based feature helps handle certain types of questions in a much better way.

Transfer Learning Based Approach

While supervised techniques provide superior performance, they need ongoing instructor involvement for creating labeled data in the form of graded answer-scripts. This undermines the benefit obtained by the automation of the grading process. We propose an iterative transfer learning based approach for ASAG by considering different questions as different domains. Building on the supervised ensemble framework, we employ canonical correlation analysis based common feature representation to transfer model answer based features from a source question to a target question and iteratively build the labeled data pool for the students' response based classifier.

Selecting an Optimal Technique

We highlight that the performance of ASAG techniques depends on the nature and difficulty levels of questions as well as on the diversity of students' answers. This is compounded by the fact that evaluation measures for ASAG techniques have remained fairly non-standard and different measures favor different ASAG techniques. Building on this observation, we propose a rigorous method to automatically learn a mapping from questions to ASAG techniques using minimal human expert/crowd) feedback. We do this by formulating the learning task as a contextual bandits problem and provide a regret minimization algorithm that handles key practical considerations such as noisy experts and similarity between questions.

Video Archive Go Top

 

Series: M.Sc. (Engg) Colloquium
Title: Supervised Classification of Missense Mutations as Pathogenic or Tolerated using Ensemble Learning Methods

  • Speaker: Ms. Rashmi Balasubramanyam
                   M.Sc (Engg.) student at CSA department
  • Faculty Advisor: Prof. Chiranjib Bhattacharyya
  • Date and Time: Tuesday, March 28, 2017, 10:00 AM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
Missense mutations account for more than 50% of the mutations known to be involved in human inherited diseases. Missense classification is a challenging task that involves sequencing of the genome, identifying the variations, and assessing their deleteriousness. This is a very laborious, time and cost intensive task to be carried out in the laboratory. Advancements in bioinformatics have led to several large-scale next-generation genome sequencing projects, and subsequently the identification of genome variations. Several studies have combined this data with information on established deleterious and neutral variants to develop machine learning based classifiers.

There are significant issues with the missense classifiers due to which missense classification is still an open area of research. These issues can be classified under two broad categories: (a) Dataset overlap issue - where the performance estimates reported by the state-of-the-art classifiers are overly optimistic as they have often been evaluated on datasets that have significant overlaps with their training datasets. Also, there is no comparative analysis of these tools using a common benchmark dataset that contains no overlap with the training datasets, therefore making it impossible to identify the best classifier among them. Also, such a common benchmark dataset is not available. (b) Inadequate capture of vital biological information of the protein and mutations - such as conservation of long-range amino acid dependencies, changes in certain physico-chemical properties of the wild-type and mutant amino acids, due to the mutation. It is also not clear how to extract and use this information. Also, some classifiers use structural information that is not available for all proteins.

In this study, we compiled a new dataset, containing around 2 - 15% overlap with the popularly used training datasets, with 18,036 mutations in 5,642 proteins. We reviewed and evaluated 15 state-of-the-art missense classifiers - SIFT, PANTHER, PROVEAN, PhD-SNP, Mutation Assessor, FATHMM, SNPs&GO, SNPs&GO-3D, nsSNPAnalyzer, PolyPhen-2, SNAP, MutPred, PON-P2, CONDEL and MetaSNP, using the six metrics - accuracy, sensitivity, specificity, precision, NPV and MCC. When evaluated on our dataset, we observe huge performance drops from what has been claimed. Average drop in the performance for these 13 classifiers are around 15% in accuracy, 17% in sensitivity, 14% in specificity, 7% in NPV, 24% in precision and 30% in MCC. With this we show that the performance of these tools is not consistent on different datasets, and thus not reliable for practical use in a clinical setting.

As we observed that the performance of the existing classifiers is poor in general, we tried to develop a new classifier that is robust and performs consistently across datasets, and better than the state-of-the-art classifiers. We developed a novel method of capturing long-range amino acid dependency conservation by boosting the conservation frequencies of substrings of amino acids of various lengths around the mutation position using AdaBoost learning algorithm. This score alone performed equivalently to the sequence conservation based tools in classifying missense mutations. Popularly used sequence conservation properties was combined with this boosted long-range dependency conservation scores using AdaBoost algorithm. This reduced the class bias, and improved the overall accuracy of the classifer. We trained a third classifier by incorporating changes in 21 important physico-chemical properties, due to the mutation. In this case, we observed that the overall performance further improved and the class bias further reduced. The performance of our final classifier is comparable with the state-of-the-art classifiers. We did not find any significant improvement, but the class-specific accuracies and precisions are marginally better by around 1-2% than those of the existing classifiers. In order to understand our classifier better, we dissected our benchmark dataset into: (a) seen and unseen proteins, and (b) pure and mixed proteins, and analyzed the performance in detail. Finally we concluded that our classifier performs consistently across each of these categories of seen, unseen, pure and mixed protein.

Video Archive Go Top

 

Series: M.Sc. (Engg) Colloquium
Title: Computing Contour Trees for 2D Piecewise Polynomial Functions

  • Speaker: Girijanandan Nucha
                   MSc (Engg) student
  • Faculty Advisor: Prof. Vijay Natarajan
  • Date and Time: Monday, March 27, 2017, 11:00 AM
  • Venue: CSA Multimedia Class (Room No. 252, First Floor)

Abstract
Contour trees are extensively used in scalar field analysis. The contour tree is a data structure that tracks the evolution of level set topology in a scalar field. Scalar fields are typically available as samples at vertices of a mesh and are linearly interpolated within each cell of the mesh. A more suitable way of representing scalar fields, especially when a smoother function needs to be modeled, is via higher order interpolants. We propose an algorithm to compute the contour tree for such func- tions. The algorithm computes a local structure by connecting critical points using a numerically stable monotone path tracing procedure. Such structures are computed for each cell and are stitched together to obtain the contour tree of the function. The algorithm is scalable to higher degree inter- polants whereas previous methods were restricted to quadratic or linear interpolants. The algorithm is intrinsically parallelizable and has potential applications to isosurface extraction.

Video Archive Go Top

 

Series: Ph.D. Thesis Defense
Title: Resolving the Complexity of Some Fundamental Problems in Computational Social Choice

  • Speaker: Mr. Palash Dey
                   Ph.D student
  • Faculty Advisor: Prof. Y. Narahari and Dr. Arnab Bhattacharyya
  • Date and Time: Friday, March 24, 2017, 9:00 AM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
In many real world situations, especially involving multiagent systems and artificial intelligence, participating agents often need to agree upon a common alternative even if they have differing preferences over the available alternatives. Voting is one of the tools of choice in these situations. Common and classic applications of voting in modern applications include collaborative filtering and recommender systems, metaserach engines, coordination and planning among multiple automated agents, etc. Agents in these applications usually have computational power at their disposal. This makes the study of computational aspects of voting crucial. This thesis is devoted to a study of computational complexity of several fundamental problems arising in the context of voting theory.

The typical setting for our work is an "election"; an election consists of a set of voters or agents, a set of alternatives, and a voting rule. The vote of any agent can be thought of as a ranking (more precisely, a complete order) of the set of alternatives. A voting profile comprises a collection of votes of all the agents. Finally, a voting rule is a mapping that takes as input voting profile and outputs an alternative, which is called the "winner" or "outcome" of the election. Our work can be categorized into three parts as described below.

Part I: Preference Elicitation

In the first part of the thesis, we study the problem of eliciting the preferences of a set of voters by asking a small number of comparison queries (such as who a voter prefers between two given two alternatives) for various interesting domains of preferences.

We commence with considering the domain of single peaked preferences on trees. We show tight dependencies between query complexity of preference elicitation and various parameters of the single peaked tree, for example, number of leaves, diameter, path width, maximum degree of a node, etc. We next consider preference elicitation for the domain of single crossing preference profiles. We establish that the query complexity of preference elicitation in this domain crucially depends on how the votes are accessed and on whether or not any single crossing ordering is known a priori.

Part II: Winner Determination in Elections

In the second part of the thesis, we undertake a study of computational complexity of several important problems related to determining the winner of an election. We begin with a study of the following problem: Given an election, predict the winner of the election under some fixed voting rule by sampling as few votes as possible. We establish optimal or almost optimal bounds on the number of votes one needs to sample for many commonly used voting rules. We next study efficient sampling based algorithms for estimating the margin of victory of a given election for many common voting rules.

Following this, we design an optimal algorithm for determining the plurality winner of an election when the votes are arriving one by one in a streaming fashion. Our work resolves an intriguing question on finding heavy hitters, that has remained open for more than 35 years, in the data stream literature. We also provide near optimal algorithms for determining the winner of a stream of votes for other popular voting rules, for example, veto, Borda, maximin, etc.

Voters' preferences are often partial orders instead of complete orders. This is known as the incomplete information setting in computational social choice theory. We study the kernelization complexity (under the complexity theoretic framework of parameterized complexity) of the possible winner problem and show that there do not exist kernels of size that is polynomial in the number of alternatives for this problem for commonly used voting rules. However, we show that the problem of coalitional manipulation which is an important special case of the possible winner problem admits a kernel whose size is polynomially bounded in the number of alternatives for common voting rules.

Part III: Election Control

In the final part of the thesis, we study the computational complexity of various interesting aspects of strategic behavior in voting.

First, we consider the impact of partial information in the context of strategic manipulation. We show that lack of complete information makes the computational problem of manipulation intractable for many commonly used voting rules. Next, we initiate the study of the computational problem of detecting possible instances of election manipulation. We show that detecting manipulation may be computationally easy under certain scenarios even when manipulation is intractable.

The computational problem of bribery is an extensively studied problem in computational social choice theory. We study computational complexity of bribery when the briber is frugal in nature. We show for many common voting rules that the bribery problem remains intractable even when the briber's behavior is restricted to be frugal, thereby strengthening intractability results from the literature.

Video Archive Go Top

 

Series: Department Seminar
Title: Learning from Untrusted Data

  • Speaker: Prof. Moses Charikar,
                   Computer Science Stanford University USA.
  • Date and Time: Friday, March 24, 2017, 3:00 PM
  • Venue: CSA Lecture Hall (Room No. 117, Ground Floor)

Abstract
The vast majority of theoretical results in machine learning and statistics assume that the available training data is a reasonably reliable reflection of the phenomena to be learned or estimated. Similarly, the majority of machine learning and statistical technique s used in practice are brittle to the presence of large amounts of biased or malicious data. In this work we propose two novel frameworks in which to study estimation, learning, and optimization in the presence of significant fractions of arbitrary data. The first framework, which we termlist-decodable learning, asks whether it is possible to return a list of answers, with the guarantee that at least one of them is accurate. For example, given a dataset of n points for which an unknown subset of αn points are drawn from a distribution of interest, and no assu mptions are made about the remaining (1−α)n points, is it possible to return a list of poly(1/α) answers, one of which is correct? The second framework, which we term the semi-verified learning model considers the extent to which a small dataset of trusted data (drawn from th e distribution in question) can be leveraged to enable the accurate extraction of information from a much larger but untrusted dataset (of which only an α-fraction is drawn from the distribution). We show strong positive results in both settings, and provid e an algorithm for robust learning in a very general stochastic optimization setting. This general result has immediate implications for robust esti- mation in a number of settings, including for robustly estimating the mean of distributions with bounded second moments, robustly learning mixtures of such distributions, and robustly finding planted partitions in random graphs in which significant portions of the graph have been perturbed by an adversary.

Speaker Bio:
Moses Charikar is a professor of Computer Science at Stanford University. He obtained his PhD from Stanford in 2000, spent a year in the research group at Google, and was on the faculty at Princeton from 2001-2014. He is broadly interested in the design and analysis of algorithms with an emphasis on approximation algorithms for hard problems, metric embeddings and algorithmic techniques for big data. His work on dimension reduction won the best paper award at FOCS 2003. He was awarded the 2012 Paris Kanellakis Theory and Practice Award for his work on locality sensitive hashing, and was named a Simons Investigator in theoretical computer science in 2014.

Host Faculty: Dr. Anand Louis

Video Archive Go Top

 

Series: Ph.D. Thesis Defense
Title: "Design of Quality Assuring Mechanisms with Learning, for Strategic Crowds"

  • Speaker: Satyanath Bhat
                   Ph.D student
  • Faculty Advisor: Prof. Y Narahari
  • Date and Time: Wednesday, March 22, 2017, 11:30 AM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
In this thesis, we address several generic problems concerned with procurement of tasks from a crowd that consists of strategic workers with uncertainty in their qualities. These problems assume importance as the quality of services in a service marketplace is known to degrade when there is (unchecked) information asymmetry pertaining to quality. Moreover, crowdsourcing is increasingly being used for a wide variety of tasks these days since it offers high levels of flexibility to workers as well as employers. We seek to address the issue of quality uncertainty in crowdsourcing through mechanism design and machine learning. As the interactions in web-based crowdsourcing platform are logged, the data captured could be used to learn unknown parameters such as qualities of individual crowd workers. Further, many of these platforms invite bids by crowd workers for available tasks but the "strategic" workers may not bid truthfully. This warrants the use of mechanism design to induce truthful bidding. There ensues a complex interplay between machine learning and mechanism design, leading to interesting technical challenges. We resolve some generic challenges in the context of the following problems.

(1) Design of a quality eliciting mechanism with interdependent values

We consider an expert sourcing problem, where a planner seeks opinions from a pool of experts. Execution of the task at an assured quality level in a cost effective manner turns out to be a mechanism design problem when the individual qualities are private information of the experts. Also, the task execution problem involves interdependent values, where truthfulness and efficiency cannot be achieved in an unrestricted setting due to an impossibility result. We propose a novel mechanism that exploits the special structure of the problem and guarantees allocative efficiency, ex-post incentive compatibility and strict budget balance for the mechanism, and ex-post individual rationality for the experts.

(2) Design of an optimal bi-dimensional crowdsourcing auction

We study the problem faced by an auctioneer who gains stochastic rewards by procuring multiple units of a service from a pool of heterogeneous strategic workers. The reward obtained depends on the inherent quality of the worker; the worker's quality is fixed but unknown. The costs and capacities are private information of the workers. The auctioneer is required to elicit costs and capacities (making the mechanism design bi-dimensional) and further, has to learn the qualities of the workers as well, to enable utility maximization. To solve this problem, we design a bi-dimensional multi-armed bandit auction that maximizes the expected utility of the auctioneer subject to incentive compatibility and individual rationality while simultaneously learning the unknown qualities of the agents.

(3) Design of a multi-parameter learning mechanism for crowdsourcing

We investigate the problem of allocating divisible jobs, arriving online, to workers in a crowdsourcing platform. Each job is split into a certain number of tasks that are then allocated to workers. These tasks have to meet several constraints that depend on the worker performance. The performance of each worker in turn is characterized by several intrinsic stochastic parameters. In particular, we study a problem where each arriving job has to be completed within a deadline and each task has to be completed, honouring a lower bound on quality. The job completion time and quality of each worker are stochastic with fixed but unknown means. We propose a learning mechanism to elicit the costs truthfully while simultaneously learning the stochastic parameters. Our proposed mechanism is dominant strategy incentive compatible and ex-post individually rational with asymptotically optimal regret performance.

Video Archive Go Top

 

Series: Ph.D. Thesis Defense
Title: Incentive Design for Crowdfunding and Crowdsourcing Markets

  • Speaker: Mr. Praphul Chandra,
                   Hewlett Packard Enterprise, Bangalore
                   (ERP Candidate)
  • Faculty Advisor: Prof. Y. Narahari (IISc) and Dr. Sitaram Ramachandrula (Hewlett Packard Enterprise)
  • Date and Time: Tuesday, March 21, 2017, 11:30 AM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
With the ever-increasing trend in the number of social interactions getting intermediated by technology (the world wide web) as the backdrop, this thesis focuses on the design of mechanisms for online communities (crowds)which strive to come together, albeit in ad-hoc fashion, to achieve a social objective. Two examples of such web-based social communities which are of central concern in this thesis are crowdsourcing markets and crowdfunding platforms. For these settings which involve strategic human agents, we design mechanisms that incentivize contributions (effort, funds, or information) from the crowd and aggregate these contributions to achieve the specified objective. Our work is thus concerned with the challenge of designing mechanisms which encourage social coordination and cooperation among large groups of (often) anonymous users to achieve group efficiency (social welfare). We address the following related challenges:

• Can we design mechanisms to solve the free-rider problem in public goods settings? Can large anonymous groups of individuals be incentivized to contribute to create public goods?

• Can we design mechanisms that harness social networks to achieve coordination of contributions towards a public good to ensure that publicly desirable goods are successfully funded via private contributions? How do we make such mechanisms fair?

• Can we design mechanisms that improve the efficiency of markets by expanding the set of individuals who participate in the market? Can these individuals be incentivized to increase the group efficiency and, if so, at what cost?

• Can we design mechanisms that make crowdsourcing markets more equitable by offering participation opportunities to novices and higher incentives to agents with high reliability? What is the price of reliability?

Using mechanism design as the principal design tool, the thesis attempts to offer rigorous solutions to the above challenges.

Video Archive Go Top

 

Series: Ph.D. Thesis Defense
Title: Non-Parametric Clustering of Multivariate Count Data

  • Speaker: Ms. Lavanya Sita Tekumalla
  • Faculty Advisor: Prof. Chiranjib Bhattacharyya
  • Date and Time: Monday, March 20, 2017, 9:30 AM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
While there has been significant work in Bayesian non-parametric modeling in the last decade there has been much less work on non-parametric clustering of Multivariate Count Data. The first contribution of this thesis addresses extensions to clustering with Poisson based models for multivariate counts. While the Poisson is the most popular distribution for count modeling, the Multivariate Poisson, often leads to intractable inference and a sub-optimal fit of the data. To address this, we introduce a family of models based on the Sparse-Multivariate Poisson, that exploit the inherent sparsity in multivariate data, leading to a better fit and more efficient inference. We explore Dirichlet process mixture model extensions and temporal non-parametric extensions to models based on the Sparse Multivariate Poisson for practical use of Poisson based models for non-parametric clustering of multivariate counts. A second contribution of this thesis addresses moving beyond the limitations of Poisson based models for non-parametric clustering, for instance in handling over dispersed data or data with negative correlations. We explore, for the first time, marginal independent inference techniques based on the Gaussian Copula for multivariate count data in the Dirichlet Process mixture model setting. This enables non-parametric clustering of multivariate counts without limiting assumptions that usually restrict the marginals to belong to a particular family. This inference technique can also work for mixed data (a combination of counts, binary and continuous data) enabling Bayesian non-parametric modeling to be used for a wide variety of data types. The third contribution of this thesis addresses modeling a wide range of more complex dependencies such as asymmetric and tail dependencies with Vine Copula based Dirichlet process mixtures. While vine copula inference has been well explored for continuous data, it is still a topic of active research for multivariate counts and mixed multivariate data. An efficient marginal independent inference approach based on extended rank likelihood, is proposed in this thesis, extending the use vines for multivariate counts and mixed data in practical clustering scenarios.

This thesis also explores the novel systems application of Bulk Cache Preloading by analyzing I/O traces though predictive models for temporal non-parametric clustering of multivariate count data. State of the art techniques in the caching domain are limited to exploiting short-range correlations in memory accesses at the milli-second granularity or smaller. We explore for the first time, Bulk Cache Preloading, the process of pro-actively predicting data to load into cache, minutes or hours before the actual request from the application, by leveraging longer range correlation at the granularity of minutes or hours. Our approach involves data aggregation, converting I/O traces into a temporal sequence of multivariate counts, that we analyze with the temporal non-parametric clustering models.

As an additional contribution, this thesis addresses multi-level non-parametric admixture modeling for discrete data in the form of grouped categorical data. Non-parametric topic models for document collections, is well explored with admixture models such as the Hierarchical Dirichlet Process. However, there exist scenarios, where a document requires being associated with themes at multiple levels, where each theme is itself an admixture over themes at the previous level, motivating the need for multilevel admixtures. We propose the nested Hierarchical Dirichlet Process to address this problem and apply a two level version of our model for non-parametric entity-topic modeling to automatically learn entities in the form of authors and topics from research corpora.

Video Archive Go Top

 

Series: Department Seminar
Title: Designing Efficient Budget-Constrained Incentive Schemes: To Compete or To Cooperate?

  • Speaker: Dr. Ragavendran Gopalakrishnan
                   Conduent Labs
  • Date and Time: Thursday, March 16, 2017, 4:00 PM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
This paper studies designing real-world performance-based incentive schemes, which are necessarily budget-constrained, must be simple to understand and implement, and satisfy some fairness criteria. Participants incur a cost of effort for their performance, which they trade off with the incentives received. The resulting competition for incentives is modeled as a noncooperative game. First, we characterize budget-constrained incentive schemes that maximize performance at Nash equilibrium, and observe, interestingly, that the maximum performance is a concave function of the available budget. Second, we study competitive vs. cooperative fairness among high and low performing players to quantify the efficiency loss in moving from Nash to dominant strategy equilibrium; perhaps surprisingly, under a certain mixed fairness model, this loss is zero, and the equilibrium is unique. Third, we perform numerical experiments (when players are homogeneous/heterogeneous in their type, and when the employer has only partial type information), and observe interesting phenomena, e.g., an employer’s inability to type-discriminate is not a significant disadvantage, and the expected maximum average performance is quite robust to one-sided type-estimation error. Finally, using real employee data from a large BPO organization, we show that our incentive scheme can result in up to a 29.8% increase in average performance of its employees.

Host Faculty: Dr. Siddharth Barman

Video Archive Go Top

 

Series: M.Sc. (Engg) Colloquium
Title: Generalizations of Hitting, Covering and Packing problems on Intervals

  • Speaker: Mr. Datta Krupa
                   M.Sc student at CSA department
  • Faculty Advisor: Prof. Sathish Govindarajan
  • Date and Time: Tuesday, March 14, 2017, 10:30 AM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
Interval graphs are well studied structures. Intervals can represent resources like jobs to be scheduled. Finding maximum independent set in interval graphs would correspond to scheduling maximum number of non-conflicting jobs on the computer. Most optimization problems on interval graphs like independent set, vertex cover, dominating set, maximum clique, etc can be solved efficiently using combinatorial algorithms in polynomial time. Hitting, Covering and Packing problems have been extensively studied in the last few decades and have applications in diverse areas. While they are NP-hard for most settings, they are polynomial solvable for intervals. In this thesis, we consider the generalizations of hitting, covering and packing problems for intervals. For the general setting, we model the problems as min-cost flow problems using non-trivial reduction and solve it using standard flow algorithms.

Demand-hitting problem which is a generalization of hitting problem is defined as follows: Given n intervals, a positive integer demand for every interval, m points, a real weight for every point, select a subset of points H, such that every interval contains at least its demand many points from H and sum of the weight of points in H minimized. When demands of intervals are one we give dynamic programming based O(m+n) time algorithm assuming that intervals and points are sorted. When demands are same, (say k), we give O(m^2*n) time flow based algorithm. For the general case, we make an assumption that no interval is contained in another interval. Under this assumption, we give O(m^2*n) time flow based algorithm.

Demand-covering problem which is a generalization of covering problem is defined as follows: Given n intervals, a real weight for every interval, m points, a positive integer demand for every point, select a subset of intervals C, such that every point is contained in at least its demand many intervals from C and sum of the weight of intervals in C minimized. When demand of points are one, we give dynamic programming based O(m+n log n) time algorithm. When demands are same, (say k) we give O(m*n^2) time flow based algorithm. For the general case, we give O(m*n^2) time flow based algorithm.

k-pack points problem which is a generalization of packing problem is defined as follows: Given n intervals, an integer k, m points, a real weight for every point, select a subset of points Y, such that every interval contains at most k points from Y and sum of the weight of points in Y is maximized. For k-pack points problem, we give O(m^2 log m) time flow based algorithm to solve it assuming intervals and points are sorted.

Video Archive Go Top

 

Series: Ph.D. Colloquium
Title: Optimization Algorithms for Deterministic, Stochastic and Reinforcement Learning Settings

  • Speaker: Mr. Ajin George Joseph
                   PhD student at CSA department
  • Faculty Advisor: Prof. Shalabh Bhatnagar
  • Date and Time: Monday, March 13, 2017, 2:30 PM
  • Venue: CSA Seminar Hall (Room No. 254, First Floor)

Abstract
Optimization is a very important field with diverse applications in physical, social and biological sciences and in various areas of engineering. It appears widely in machine learning, information retrieval, regression, estimation, operations research and a wide variety of computing domains. The subject is being deeply studied and a lot of algorithms are available in the literature. These algorithms which can be executed (sequentially or concurrently) on a computing machine explore the space of input parameters to seek high quality solutions to the optimization problem with the search mostly guided by certain structural properties of the objective function. In this thesis, we propose an optimization algorithm which is gradient-free, i.e., does not employ any knowledge of the gradient or higher order derivatives of the objective function, rather utilizes objective function values themselves to steer the search. The proposed algorithm is particularly effective in a black-box setting, where a closed-form expression of the objective function is unavailable and gradient or higher-order derivatives are hard to compute or estimate. Our algorithm is inspired by the well known cross entropy (CE) method. The CE method is a model based search method to solve continuous/discrete multi-extremal optimization problems, where the objective function has minimal structure. The proposed method seeks, in the statistical manifold of the parameters which identify the probability distribution/model defined over the input space to find the degenerate distribution concentrated on the global optima (assumed to be finite in quantity). In the early part of the thesis, we propose a novel stochastic approximation version of the CE method to the unconstrained optimization problem, where the objective function is real-valued and deterministic. The basis of the algorithm is a stochastic process of model parameters which is probabilistically dependent on the past history, where we reuse all the previous samples obtained in the process till the current instant based on discounted averaging. This approach can save the overall computational and storage cost. Our algorithm is incremental in nature and possesses attractive features such as stability, computational and storage efficiency and better accuracy. We further investigate, both theoretically and empirically, the asymptotic behaviour of the algorithm and find that the proposed algorithm exhibits global optimum convergence for a particular class of objective functions.

Further, we extend the algorithm to solve the simulation/stochastic optimization problem. In stochastic optimization, the objective function possesses a stochastic characteristic, where the underlying probability distribution in most cases is hard to comprehend and quantify. This begets a more challenging optimization problem, where the pretentious nature is due to the hardness in computing the objective function values for various input parameters with absolute certainty. In this case, one can only hope to obtain noise corrupted objective function values for various input parameters. Settings of this kind can be found in scenarios where the objective function is evaluated using a continuously evolving dynamical system or through a simulation. We propose a multi-timescale stochastic approximation algorithm, where we integrate an additional timescale to accommodate the noisy measurements and decimate the effects of the gratuitous noise asymptotically. We found that if the objective function and the noise involved in the measurements are well behaved and additionally the timescales are compatible then our algorithm can generate high quality solutions. In the later part of the thesis, we propose algorithms for reinforcement learning/Markov decision processes using the optimization techniques we developed in the early stage. MDP can be considered as a generalized framework for modeling planning under uncertainty. We provide a novel algorithm for the problem of prediction in reinforcement learning, i.e., estimating the value function of a given stationary policy of a model free MDP (with large state and action spaces) using the linear function approximation architecture. Here, the value function is defined as the long-run average of the discounted transition costs. The resource requirement of the proposed method in terms of computational and storage cost scales quadratically in the size of the feature set. The algorithm is an adaptation of the multi-timescale variant of the CE method proposed in the earlier part of the thesis for simulation optimization. We also provide both theoretical and empirical evidence to corroborate the credibility and effectiveness of the approach.

In the final part of the thesis, we consider a modified version of the control problem in a model free MDP with large state and action spaces. The control problem most commonly addressed in the literature is to find an optimal policy which maximizes the value function, i.e., the long-run average of the discounted transition payoffs. The contemporary methods also presume access to a generative model/simulator of the MDP with the hidden premise that observations of the system behaviour in the form of sample trajectories can be obtained with ease from the model. In this thesis, we consider a modified version, where the cost function to be optimized is a real-valued performance function (possibly non-convex) of the value function. Additionally, one has to seek the optimal policy without presuming access to the generative model. In this thesis, we propose a stochastic approximation algorithm for this peculiar control problem. The only information, we presuppose, available to the algorithm is the sample trajectory generated using a priori chosen behaviour policy. The algorithm is data (sample trajectory) efficient, stable, robust as well as computationally and storage efficient. We also provide a proof of convergence of our algorithm to a high performing policy relative to the behaviour policy.

Video Archive Go Top

 

 

 

 

 

 

 

 

 

 

Copyright: CSA, IISc 2017      Phone: +91-80-22932368          Fax: +91-80-23602911 Travel Blog    Feedback    Credits