Recent PhD Research: Alexander Kalinin Creates Systems Tools For Large Multidimensional Data Sets
- Posted by Jesse Polhemus
- on Sept. 7, 2016
Brown CS has just begun a project to highlight the wide-ranging and innovative research of our PhD students as they approach graduation. Our latest contributor, Alexander Kalinin, discusses his work below. He'll be defending his thesis on Monday, September 12, at 12 PM in CIT 506. To see other PhD research overviews, click here.
Integrated Search and Exploration Over Large Multidimensional Data
In the era of Big Data, efficient, interactive ad-hoc data analysis has become a necessity for professionals from the variety of scientific areas. Take an astronomer, for example. She might be looking at various astronomical data sets, like the Sloan Digital Sky Survey (SDSS) or the Large Synoptic Survey Telescope (LSST). Her tasks may vary from relatively simple retrieval of the information about different celestial objects to complex search queries. An example of a complex search query might include searching for an interesting rectangular region in the sky, satisfying constraints on the shape and content. The user might describe the perimeter and area of the region or give the ranges of lengths in various dimensions. She might also add constraints for various astronomical measures of objects within such a region, e.g. in terms of photometric magnitudes available in SDSS.
Region-based search for an astronomical data-set. Searching for bright regions of different sizes.
Another example includes a researcher working with a medical database, like the MIMIC data set. MIMIC contains signal waveforms for ICU patients over a large period of time. The researcher might look for different time intervals (i.e. one-dimensional regions) that satisfy certain conditions. For example, she might be looking for the time interval similar to some predefined pattern or containing a signal spike that makes it an anomaly. Such region-based data exploration has a remarkable expressibility for a plethora of interesting tasks.
An example of region-based MIMIC search. Searching for time intervals similar to the specified one. The similarity is based on the Euclidean distance.
What makes such problems hard? The reasons are twofold. First of all, there is a very large search space of possible objects to consider. Regions can be anywhere in the data, can overlap and even be of different shapes. The user just gives some constraints, which express her interests, but does not give the location of the region itself. This immediately requires the application of sophisticated search techniques. Consider the naive method of looking at every possible region in the sky, for instance, and checking the constraints for each of them! This is simply infeasible! Secondly, the constraints themselves require potentially expensive computation with costly I/O accesses. For example, computing the average amplitude of a signal within a time-interval might require accessing a large number of measurements and performing an aggregation over them.
Surprisingly, to the best of our knowledge, there are no existing systems that effectively and efficiently address both of these issues. Traditional DBMSs, while quite proficient at managing vast amounts of data, struggle with search problems considerably. In a nutshell, a traditional DBMS can very efficiently retrieve all data belonging to a particular region, but it cannot efficiently find a region satisfying the user-defined constraints. The former is a data retrieval problem, while the latter is a search one. At the same time, there has been extensive research of various algorithms and methods dealing with large search spaces. In particular, our research leverages Constraint Programming (CP) solvers. Such solvers employ a variety of methods to explore large search spaces efficiently and identify results quickly. At the same time, due to their inherent generality, they are remarkably expressive. However, CP methods do not work well with large data sets. They assume the constraints are relatively cheap to verify, which, as we pointed out above, is not necessarily true for modern exploration problems.
Our research deals with the problem of bringing generic and reusable systems tools for interactive search, exploration and mining over large multidimensional data sets to users from different domains of knowledge. We specifically investigate data-driven search techniques via the marriage between the traditional DBMS and CP technologies. This union offers the rich expressiveness and efficiency of constraint-based search and optimization provided by modern CP solvers, and the ability of DBMS to store and query data at scale, resulting in an enriched functionality that can effectively support data- and search-intensive applications.
The first part of our research is the realization of our general vision of combining CP and traditional DBMSs for efficient data-intensive exploration. We called the new system Searchlight. This system allows the CP machinery to run seamlessly inside a DBMS without the need to transform, extract and move the data. The main idea behind Searchlight is to use speculative query execution over the in-memory synopsis instead of the original data. Since CP solvers thrive at such tasks, this allows Searchlight to quickly identify possible results. At the same time, using synopsis, which can be seen as lossy compression of the original data, means the results might contain false positives. Searchlight efficiently filters out such false positives by validating query constraints over the original data concurrently with the main search process. The Speculate-Validate approach brings with it a number of interesting challenges, including search- and data-balancing, and mediating of CPU resources between the CP solver and the DBMS, which has been a significant part of our research. Our approach has shown very promising results for both interactive and query completion performance. For many queries Searchlight outperforms standalone CP or traditional DBMS solutions by orders of magnitude in both the time to output the first results and query completion times.
At the same time, fast, interactive query evaluation is only one of the requirements of effective data-exploration support. The users might be unfamiliar with a new data set and have fuzzy exploration goals, at least initially. Thus, it might be cumbersome for them to specify exploration queries correctly. Incorrect query specification leads to unsatisfactory results, and pushes the users to initiate a query guessing game with the system: trying to find queries allowing representative, easier to analyze results. The second part of our research deals with two manifestations of this problem: the problem of too few and too many results. Searchlight can automatically relax or constrain the query during the execution bringing more or fewer results, depending on the user’s preference. By relaxing the query, Searchlight can discover approximate results that are still close to the original constraints. By constraining the query, Searchlight can rank results and efficiently filter out the inferior ones. To achieve this Searchlight dynamically changes the query constraints during the execution and adds new ranking constraints to the query. Since constraints and the CP solver are treated as first-class citizens by Searchlight, this results in a very natural solution to the problem. Our experimental results have shown that Searchlight provides tremendous time savings for most exploration queries.
This research has a number of interesting future directions as well. When we combined the two technologies, we have only scratched the surface of potential optimizations. Among the promising directions are a deeper integration of the CP solver and DBMS, when the two can exchange information about the constraints, search space and the data; adaptive query optimization, with dynamic choice and change of search heuristics; and investigating more sophisticated queries consisting of multiple search sub-problems joined together by additional constraints.
Combining two existing, well-researched technologies is no less an interesting task than creating a new system from scratch. At the same time, it does not result in a decreased number of interesting challenges. In our case these challenges stem from the inherent differences in both architecture and purposes of the two systems: CP solvers and traditional DBMSs. By combining these two technologies into a single system we have been able to demonstrate the benefits of such a union for both interactive and total performance for data-intensive exploration tasks. We believe Searchlight will be a useful tool for interactive data exploration and look forward at exploring more use cases and solving the new systems challenges!