jacksonSquare.jpg
© Copyright 2008 Roy Tennant, FreeLargePhotos.com

  21st International Conference on
  Scientific and Statistical Database Management

   June 2-4, 2009     

   Hotel Monteleone, New Orleans, Louisiana USA
   http://ssdbm09.cs.uno.edu

Information for Authors
     Call for Papers
     Submission Instructions
     Submission Site
     Important Dates
     Accepted Papers
     Conference Program
     Proceedings
     Pictures

Registration, Accommodations & Travel
     New Orleans
     Conference Venue
     Obtain Visa
     Getting There

Program Organization
   
Program Officers
     Program Committee
     Steering Committee

Previous SSDBMs

Sponsors

uno_reversed_150.gifDepartment of
Computer Science

Accepted Papers

 

Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

David Chiu and Gagan Agrawal
Presentation
Poster
Technological success has helped usher massive amounts of data for scientific analysis. To enable effective utilization of these data sets for all classes of users, supporting intuitive access and manipulation interfaces is crucial. This paper describes an autonomous scientific workflow system that enables high-level, natural language based, queries over low-level data sets. Our technique involves a combination of natural language processing, metadata indexing, and a semantically-aware workflow composition engine which dynamically constructs workflows for answering queries based on service and data availability. A specific contribution of this work is a metadata registration scheme that allows for a unified index of heterogeneous metadata formats and service annotations. Our approach thus avoids a standarized format for storing all data sets or the implementation of a federated, mediator-based, querying framework. We have evaluated our system using a case study from the geospatial domain to show functional results. A comparison between workflow planning times between using the metadata indexing scheme and linear search for data identification is also presented. Our evaluation supports the potential benefits which our aaproach can offer to scientific workflow systems and other domain-specific, data intensive applications.

Finding Regions of Interest in Large Scientific Datasets

Rishi Sinha, Marianne Winslett and Kesheng Wu
Consider a scientific range query such as find all places in Africa where yesterday the temperature was over 35 degrees and it rained. In theory, one can answer such queries by returning all geographic points that satisfy the query condition. However, in practice, users do not find this low-level answer very useful; instead they require the points to be consolidated into regions, i.e., sets of points that all satisfy the query conditions and are adjacent in the underlying mesh. In this paper, we show that when we use a good index to find the points and a good traditional connected component labeling algorithm to build the regions, the cost of consolidating the points into regions dominates range query response time. We then show how to find query result points and consolidate them into regions in expected time that is sublinear in the number of result points. This seemingly miraculous speedup comes from a point lookup phase that uses bitmap indexes and produces a compressed bitmap as the intermediate query result, followed by a region consolidation phase that operates directly on the intermediate query result bitmap and exploits the spatial properties of the underlying mesh to greatly reduce the cost of consolidating the result points into regions. Our experiments with real-world data demonstrate that in practice, our approach to region consolidation is over 10 times faster than a traditional connected component algorithm.

Classification With Unknown Classes

Chetan Gupta, Song Wang, Umeshwar Dayal and Abhay Mehta
Given a data set D, such that (Xi; yi) in D; yi in R, we are interested in first dividing the range of yi, (ymax , ymin), where ymax is the maximum of all yi corresponding to (Xi; yi) in D and ymin is the minimum of all yi corresponding to (Xi; yi) in D, into contiguous ranges which can be thought of as classes and then for a new point, Xj , predicting which range (class) it falls into. The problem is difficult, because neither the size of the ranges nor the number of ranges, are known a-priori. This was a practical problem that arose because we wanted to predict the execution time of a query in a database. For our purposes, an accurate prediction was not required, and a time range for a query was sufficient. We did not know what the time ranges should be a-priori. To solve this problem we introduce a binary tree structure called Class Discovery Tree. We have used this technique successfully for predicting the execution times of a query and this is slated for incorporation into a commercial, enterprise class Database Management System. In this paper, we discuss our solution and validate it on two more real life data sets. With one we compare with a naive approach and in the other, with published results. In both cases, our approach is superior.

Query Recommendations for Interactive Database Exploration

Gloria Chatzopoulou, Magdalini Eirinaki and Neoklis Polyzotis
Presentation
Poster
Relational database systems are becoming increasingly popular in the scientific community to support the interactive exploration of large volumes of data. In this scenario, users issue a series of SQL queries that aim to analyze the data and mine it for interesting information. First-time users, however, may not have the necessary knowledge to know where to start their exploration. Other times, users may simply overlook queries that retrieve important information. To assist users in this context, we draw inspiration from Web recommender systems and propose the use of personalized query recommendations. We discuss the main challenges in this novel application of recommendation systems, and outline a first-cut solution based on collaborative filtering. Preliminary experimental results on real user traces demonstrate that our framework can generate effective query recommendations.

Expressing OLAP Preferences

Matteo Golfarelli and Stefano Rizzi
Poster
Multidimensional databases play a relevant role in statistical and scientific applications, as well as in business intelligence systems. Their users express complex OLAP queries, often returning huge volumes of facts, sometimes providing little or no information. Thus, expressing preferences could be highly valuable in this domain. The OLAP domain is representative of an unexplored class of preference queries, characterized by three peculiarities: preferences can be expressed on both numerical and categorical domains; they can also be expressed on the aggregation level of facts; the space on which preferences are expressed includes both elemental and aggregated facts. In this paper we propose a preference algebra for OLAP, devised by taking into account the three peculiarities above. In particular, users are enabled to express their preferences also on the aggregation level of facts, for instance by stating that monthly data are preferred to yearly and daily data. The algebra includes a set of base preference constructors, specifically tailored on the multidimensional model, and composition operators. The paper is completed by the discussion of a set of tests that prove the effectiveness of our approach.

HSM: Heterogeneous Subspace Mining in High Dimensional Data

Emmanuel Muller, Ira Assent and Thomas Seidl
Heterogeneous data, i.e. data with both categorical and continuous values, is common in many databases. However, most data mining algorithms assume either continuous or categorical attributes, but not both. In high dimensional data, phenomena due to the "curse of dimensionality" pose additional challenges. Usually, due to locally varying relevance of attributes, patterns do not show across the full set of attributes. In this paper we propose HSM, which defines a new pattern model for heterogeneous high dimensional data. It allows data mining in arbitrary subsets of the attributes that are relevant for the respective patterns. Based on this model we propose an efficient algorithm, which is aware of the heterogenity of the attributes. Using a new indexing structure HSM can adapt its processing to different attribute types. In our experiments we show that HSM efficiently mines patterns in arbitrary subspaces of heterogeneous high dimensional data.

Experiment Line: Software Reuse in Scientific Workflows

Eduardo Ogasawara, Carlos Paulino, Leonardo Murta, Claudia Werner and Marta Mattoso
Presentation
Over the last years, scientists have been using scientific workflows to build computer simulations to support the development of new theories. Due to the increasing use of scientific workflows in production environments, the composition of workflows and their executions can no longer be performed in ad-hoc manner. Although current scientific workflow management systems support the execution of workflows, they present limitations on the composition of workflows. The goal of this paper is to present a systematic approach for the composition of scientific workflows. This approach introduces the concept of experiment line to support composition based on software reuse, a Software Engineering technique. The experiment line defines families of workflows at different levels of abstractions. This concept aggregates algorithms, programs, and executable codes that correspond to the same activity of the workflow at these different abstraction levels. A case study is presented to show the benefits of this approach using bioinformatics workflows. This work also presents GExpline, a tool that encompasses this approach to the composition of workflows in scientific experiments.

Comprehensive Optimization of Declarative Sensor Network Queries

Ixent Galpin, Christian Y. Brenninkmeijer, Farhana Jabeen, Alvaro A. A. Fernandes and Norman W. Paton
Poster
We present a sensor network query processing architecture that covers all the query optimization phases that are required to map a declarative query to executable code. The architecture is founded on the view that a sensor network truly is a distributed computing infrastruc- ture, albeit a very constrained one. As such, we address the problem of how to develop a comprehensive optimizer for an expressive declarative continuous query language over acquisitional streams as one of finding extensions to classical distributed query processing techniques that con- tend with the peculiarities of sensor networks as an environment for distributed computing. We show that the approach is effective and offers benefits compared to related work.

Easing the Dimensionality Curse Stretching Metric Spaces

Ives Pola, Agma Traina and Caetano Traina
Queries over sets of complex elements are performed extracting features from each element, which are used in place of the real ones during the processing. Extracting a large number of significant features increases the representative power of the feature vector and improves the query precision. However, each feature is a dimension in the representation space, consequently handling more features worsen the dimensionality curse. The problem derives from the fact that the elements tends to distribute all over the space and a large dimensionality allows them to spread over much broader spaces. Therefore, in high-dimensional spaces, elements are frequently farther from each other, so the distance differences among pairs of elements tends to homogenize. When searching for nearest neighbors, the first one is usually not close, but as long as one is found, small increases in the query radius tend to include several others. This effect increases the overlap between nodes in access methods indexing the dataset. Both spatial and metric access methods are sensitive to the problem. This paper presents a general strategy applicable to metric access methods in general, improving the performance of similarity queries in high dimensional spaces. Our technique applies a function that "stretches" the distances. Thus, close objects becomes closer and far ones becomes even farther. Experiments using the metric access method Slim-tree show that similarity queries performed in the transformed spaces demands up to 70% less distance calculations, 52% less disk access and reduces up to 57% in total time when comparing with the original spaces.

SEEDEEP: A System for Exploring and Querying Scientific Deep Web Data Sources

Fan Wang and Gagan Agrawal
A recent and emerging trend in scientific data dissemination involves online databases that are hidden behind query forms, thus forming what is referred to as the deep web. Lately, there have been several efforts focusing on integrating and providing search functionality for deep web data sources. Such system often require data access over wide-area networks, involving a large number of remote data sources and communication links. Both the servers and networking links are vulnerable to congestion and failures, and this can cause significant and unpredictable delays or even unavailability in accessing the information. In this paper, we propose a solution to address such delays or unavailability. Our solution involves dynamically adapting query processing when unexpected delays or data source unavailability occurs. We exploit the data redundancy that is found across deep web data sources, i.e., many deep web data sources in a particular domain could overlap in terms of data content. We incrementally generate a partial new query plan by bringing in new data sources that were not in the original query plan to replace the subplan that became inaccessible due to the change of network environment. Our query processing adaptation mechanism is implemented and evaluated in the context of a previously developed deep web keyword search system which currently integrates 14 deep web data sources in the biological domain. Our experiments show that our query processing adaptation strategy significantly reduces the query answering time compared to a baseline method while the overhead of query processing adaptation is very small.

B-Fabric: An Open Source Life Sciences Data Management System

Fuat Akal, Can Tuerker, Dieter Joho and Ralph Schlapbach
The advances in information technology boosted life sciences research towards systems biology which aims at studying complex interactions in biological systems in an integrative way. Steadily improving high-throughput instruments, such as genome sequencers and mass spectrometers, and analysis software produce tremendous amounts of life sciences data. In this paper, we describe a life sciences data management system that we have implemented and deployed at the Functional Genomics Center Zurich. We highlight some crucial features and share some experiences we made in developing and running this system.

Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures

Luke Gosink, Kesheng Wu, Wes Bethel, John Owens and Kenneth Joy
The multi-core trend in CPUs introduces new challenges for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the emerging thread-level parallelism inherent in multi-core architectures. Our Data Parallel Bin-based Index Strategy (DP-BIS) is the first GPU-based strategy to utilize data encoding when answering selection queries on such hardware; this encoding facilitates a better utilization of memory resources and bus bandwidth. In our approach, DP-BIS first bins the base data, and then partitions and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads may be processed simultaneously in parallel. To demonstrate the effectiveness of DP-BIS, we implement it on a GPU using the data parallel programming language CUDA. The concurrency offered by DP-BIS allows us to fully utilize the GPU's thread-level parallelism in our work---over 12,000 records can be simultaneously evaluated at any one time. In our analysis, we show that DP-BIS provides better performance than the commonly utilized CPU and GPU projection index. Additionally, due to data encoding, DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column's base data; this smaller data footprint is critical for parallel processors that posses limited memory resources (e.g. the GPU).

Probabilistic Similarity Search for Uncertain Time Series

Johannes Abfalg, Hans-Peter Kriegel, Peer Kroger and Matthias Renz
Poster
A probabilistic similarity query over uncertain data assigns to each uncertain database object o a probability indicating the likelihood that o meets the query predicate. In this paper, we formalize the notion of uncertain time series and introduce two novel and important types of probabilistic range queries over uncertain time series. Furthermore, we propose an original approximate representation of uncertain time series that can be used to efficiently support both new query types. In particular, we show how this representation can be used to upper and lower bound the Euclidean distance as well as dynamic time warping (DTW). Based on these distance approximations we introduce a multi-step query processor for efficient probabilistic similarity search. In our experiments we illustrate the scalability of our proposed methods to large databases of uncertain time series.

Reverse k-Nearest Neighbor Search based on Aggregate Point Access Methods

Hans-Peter Kriegel, Peer Kroger, Matthias Renz and Andreas Zuefle
Poster
We propose an original solution for the general reverse k nearest neighbor (RkNN) search problem in Euclidean spaces. Compared to the limitations of existing methods for the RkNN search, our approach works on top of Multi-Resolution Aggregate (MRA) versions of any index structures for multi-dimensional feature spaces where each non-leaf node is additionally associated with aggregate information like the sum of all leaf-entries indexed by that node. Our solution outperforms the state-of-the- art RkNN algorithms in terms of query execution times because it exploits advanced strategies for pruning index entries.

Optimization and Execution of Complex Scientific Queries over Uncorrelated Experimental Data

Ruslan Fomkin and Tore Risch
Presentation
Poster
Scientific experiments produce large volumes of data represented as complex objects that describe independent events, such as particle collisions. Scientific analyses can be expressed as queries selecting objects that satisfy complex local conditions over properties of each object. The conditions include joins, aggregate functions, and numerical computations. For efficient implementation of such queries, a new data stream management system SQISLE is developed. In SQISLE large and complex queries are specified by selecting complex objects streamed from sources through the query engine. To obtain efficient query plans SQISLE implements runtime query optimization strategies, which during query execution collect runtime query statistics, reoptimize the query using collected statistics, and dynamically switch optimization strategies. Performance is further improved by query transformation, temporary view materialization, and compile time evaluation of query fragments. It is demonstrated that queries in SQISLE perform close to or better than hard-coded C++ implementations of the same analyses.

Identifying Most Endangered Objects from Spatial Datasets

Hua Lu and Man Lung Yiu
Presentation
Real-life spatial objects usually have both geographic locations (e.g., longitude and latitude), and multiple quality attributes. Conventionally, spatial data are queried by two orthogonal aspects: spatial queries involve geographic locations only; skyline queries are used to retrieve those objects that are not dominated by others on all quality attributes. An object $p_i$ dominates $p_j$ if $p_i$ is no worse than $p_j$ on all quality attributes and better than $p_j$ on at least one quality attribute. In this paper, we study a novel query that combines both aspects meaningfully. Given two spatial datasets $P$ and $S$, and a neighborhood distance $\delta$, the \emph{most endangered object query} (MEO) returns the object $s \in S$ such that within the distance $\delta$ from $s$, the number of objects in $P$ that dominate $s$ is maximized. MEO queries appropriately capture the needs that neither spatial queries nor skyline queries alone have addressed. They have various practical applications in business planning, digital battlefield and scientific research. Nevertheless, the processing of MEO queries is challenging and it cannot be efficiently evaluated by existing solutions. Motivated by this, we propose several algorithms for processing MEO queries, which can be applied in different scenarios where different indexes are available on spatial datasets. Extensive experimental results on both synthetic and real datasets show that our proposed advanced spatial join solution achieves the best performance and it is scalable to large datasets.

Cor-Split: Defending Privacy in Data Re-Publication from Historical Correlations and Compromised Tuples

Daniele Riboni and Claudio Bettini
Presentation
Poster
Several approaches have been proposed for privacy preserving data publication. In this paper we consider the important case in which a certain view over a dynamic dataset has to be released a number of times during its history. The insufficiency of techniques used for one-shot publication in the case of subsequent releases has been previously recognized, and some new approaches have been proposed. Our research shows that relevant privacy threats, not recognized by previous proposals, can occur in practice. In particular, we show the cascading effects that a single (or a few) compromised tuples can have in data re-publication when coupled with the ability of an adversary to recognize historical correlations among released tuples. A theoretical study of the threats leads us to a defense algorithm, implemented as a significant extension of the m-invariance technique. Extensive experiments using publicly available datasets show that the proposed technique preserves the utility of published data and effectively protects from the identified privacy threats.

Tracking Files Using the Kepler Provenance Framework

Pierre Mouallem, Mladen Vouk, Scott Klasky, Norbert Podhorszki and Roselyne Barreto
Workflow Management Systems (WFMS) such as Kepler has proven to be an important tool in scientific problem solving. They can automate and manage complex processes and huge amounts of data produced by petascale simulations. Typically, the produced data need to be properly visualized and analyzed by scientists in order to achieve the desired scientific goals. Both run-time and post analysis may benefit from, even require, additional meta-data provenance information. One of the challenges in this context is the tracking of the data files produced during different activities, such as the visualizations. The Kepler provenance framework collects all or part of the raw information flowing through the workflow graph. This information then needs to be further analyzed to extract meta-data of interest. In this paper we first provide an overview of the Kepler framework as used at ORNL, and then we show how the data collected through the framework can be analyzed using add-on tools and algorithms to provide an automated way of tracking data files.

A Bipartite Graph Framework for Summarizing Binary, Categorical and Numeric High-Dimensional Data

Guanhua Chen, Xiuli Ma, Dongqing Yang, Shiwei Tang and Meng Shuai
Data summarizing is an important data mining task which aims to find a compact description of a dataset. Emerging applications place special requirements to the data summarization techniques including the ability to find concise and informative summary from high dimensional data, the ability to deal with different types of attributes such as binary, categorical and numeric attributes, end-user comprehensibility of the summary, insensibility to noise and missing values and scalability with the data size and dimensionality. In this work, we propose a general framework to summarize high-dimensional data which satisfies each of these requirements. We formulate the problem in a bipartite graph scheme, mapping objects (data records) and values of attributes into two disjoint groups of nodes of a graph, in which a set of representative objects is discovered as the summary of the original data. Further, the capability of representativeness is measured in MDL principle, which helps to yield a highly intuitive summary with the most informative objects of the input data. While the problem of finding the exact summary with minimal representation cost is computationally infeasible, we propose a heuristic algorithm to achieve an approximate summary with O(n2d) computational complexity. In addition, several techniques are developed to improve both quality of the resultant summary and efficiency of the algorithm. A detailed study using both real and synthetic datasets shows the effectiveness and efficiency of our approach in summarizing datasets with binary, categorical and numeric attributes.

Designing A Geoscientific Request Language - A Database Approach

Peter Baumann
Poster
A large part of sensor, image, and statistics data in the Earth Sciences naturally come as data points aligned to regular grids of some dimension, including 1-D time series, 2-D imagery, 3-D image time series and volumetric data, and 4-D spatio-temporal data. Frequently repositories have to host objects of multi-Terabyte size, in the future multi-Petabytes. Serving this information category, large multi-dimensional arrays, is a task of increasing importance among the Earth Sciences. Still, however, ad-hoc implementations with focused functionality prevail which lack the flexibility of SQL-enabled databases. The Web Coverage Processing Service (WCPS) geo raster model and language allows navigation, extraction, and ad-hoc analysis on multi-dimensional geoscientific raster data which can be geo-referenced or not. The request language has been designed to offer some key properties considered advantageous by the database community: it is formalized, declarative, data independent, optimizable, and safe in evaluation. WCPS has been adopted as an international standard by the Open GeoSpatial Consortium (OGC) in December 2008. The reference implementation is almost finished. Actually, the embedding into the modular framework of the OGC geo service standards has posed particular constraints which the design had to respect. We discuss conceptual model, query language, and the context using real-life use case scenarios.

Design and Implementation of Metadata System in PetaShare

Xinqi Wang and Tevfik Kosar
Presentation
Poster
As the size of scientific data set grows, it becomes imperative that expressive metadata framework be developed to facilitate access to scientific data. However, the drive to enhance expressiveness and maxi- mize performance often turns out to be contradictory. Hence, the design of metadata framework needs to take into consideration many factors in order to achieve balance between expressive power and performance. In this paper, we discuss metadata framework developed for PetaShare and explain the design and implementation we made in the process.

Split-Order Distance for Clustering and Classification Hierarchies

Qi Zhang, Eric Yi Liu, Abhishek Sarkar and Wei Wang
Clustering and classification hierarchies are organizational structures of a set of objects. Multiple hierarchies may be derived over the same set of objects, which makes distance computation between hierarchies an important task for summarization and similarity search of hierarchical patterns. In this paper, we model the classification and clustering hierarchies as rooted, leaf-labeled, unordered trees. We propose a novel distance metric Split-Order distance to evaluate the organizational structure difference between two hierarchies over the same set of leaf objects. The Split-Order distance reflects the order in which subsets of the tree leaves are differentiated from each other and can be used to explain the relationships between the leaf objects. We also propose an efficient algorithm for computing Split-Order distance between two trees in O(n^2d^4) time, where $n$ is the number of leaves, and $d$ is the maximum number of children of any node. Our experiments on both real and synthetic data demonstrate the efficiency and effectiveness of our algorithm.

Evaluating Reachability Queries Over Path Collections

Panagiotis Bouros, Spiros Skiadopoulos, Theodore Dalamagas, Dimitris Sacharidis and Timos Sellis
Presentation
Poster
Nowadays, several types of applications involve retaining and querying large volumes of sequences of items. For example the recent advances in geodata infrastructure and the proliferation of earth observation applications have resulted in large route collections of waypoints and points of interest. We refer to these kind of datasets simply as path collections. We can pose several interesting queries on path collections. In this paper, we focus on reachability queries. These queries, given two nodes s and t of a path collection, can determine whether there exists a path from s to t and identify that path. We propose a method to evaluate reachability queries. To improve the performance of our method, we also propose two indexing methods (namely P-Index and H-Index) for the organization of path collections. Finally, we conduct a set of experiments that compare our proposed methods, and we report on their evaluation.

Finding Structural Similarity in Time Series Data Using Bag-of-Patterns Representation

Jessica Lin and Yuan Li
For more than one decade, time series similarity search has been given a great deal of attention by data mining researchers. As a result, many time series representations and distance measures have been proposed. However, most existing work on time series similarity search focuses on finding shape-based similarity. While some of the existing approaches work well for short time series data, they typically fail to produce satisfactory results when the sequence is long. For long sequences, it is more appropriate to consider the similarity based on the higher-level structures. In this work, we present a frequency-based representation for time series data, similar to the "bag of words" approach that is widely accepted by the text mining and information retrieval communities. We show that our approach outperforms the existing methods in clustering and classification.

Region Extraction and Verification for Spatial and Spatio-Temporal Databases

Mark McKenney
The rigorous definitions of spatial data types and the type closure properties of spatial algorithms have provided a framework to ensure spatial data correctness across complex spatial data processing tasks. Newer spatial technologies, such as spatio-temporal databases, moving object databases, data acquisition through geo-sensor networks, and large scale spatial data generation through other remote sensing methods, have created new challenges in spatial data consistency. Specifically, these new technologies require mechanisms to efficiently process spatial data and identify data items that do not conform to spatial data type definitions. Furthermore, in cases were extremely large amounts of spatial data are handled, it may be useful to automatically discard portions of spatial configurations that violate spatial data type definitions. In this paper, we propose an efficient O(n lg n ) time complexity algorithm that can examine a spatial configuration, eliminate any portions of the configuration that violate the definition of spatial regions, and construct a valid region out of the remaining configuration. The algorithm is designed with existing spatial algorithm input structures in mind so that it may be included in existing systems with minimal modification.

BioBrowsing: Making the Most of the Data Available in Entrez

Sarah Cohen-Boulakia and Kevin Masini
Presentation
Poster
One of the most popular ways to access public biological data is using portals, like Entrez (NCBI) which allows users to navigate through the data of 37 major biological sources following cross-reference links. In this process, data entries are inspected one after the other and cross-reference links to additional data available in other sources may be followed. This navigational process may be time-consuming and may not be easily reproduced from one data entry to another. Most importantly, only a few sources are initially queried, biologists are not exploiting all the richness of the information made available by Entrez to them, and in particular may not explore alternative data source paths that provide complementary information. In this paper, we introduce BioBrowsing (available online), a tool providing scientists with access to the data systematically obtained when all the combinations between sources from NCBI have been followed. Querying is done on-the-fly (no warehousing). As new sources and links between sources may appear in Entrez, BioBrowsing has an Update module, able to update automatically the schema used by the BioBrowsing's query engine. Finally, BioBrowsing makes it possible for users to define profiles as a way of focusing the results on users specific interests.

Combining Multiple Interrelated Streams for Incremental Clustering

Zaigham Faraz Siddiqui and Myra Spiliopoulou
Many data mining applications deal with structured data that span across many tables AND accumulate in time. Incremental mining methods have been devised to adapt patterns to newly arriving tuples. However, they have been designed to deal with single data tables only. We propose a method for incremental clustering of multiple interrelated streams - a "multi-table stream". Its constituting streams 1) arrive at different speeds and 2) have attributes of a priori unknown value ranges. Our method transforms the multi-table data stream into a single-table stream, encompasses solutions for caching, cache updating and data forgetting, and incorporates an incremental clustering algorithm that operates over the caches and the resulting data stream. We evaluate our method on the KDD Cup 2000 and ECML/PKDD 1999 datasets against a reference that remembers all the past and has unlimited memory. We show that our method approximates the reference well.

MLR-Index: An Index Structure for Fast and Scalable Similarity Search in High Dimensions

Rahul Malik, Sangkyum Kim, Xin Jin, Chandrasekar Ramachandran, Jiawei Han, Indranil Gupta and Klara Nahrstedt
High-dimensional indexing has been very popularly used for performing similarity search over various data types such as multimedia (audio/image/video) databases, document collections, time-series data, sensor data and scientific databases. Because of the curse of dimensionality, it is already known that well-known data structures like kd-tree, R-tree, and M-tree suffer in their performance over high-dimensional data space which is inferior to a brute-force approach linear scan. In this paper, we focus on an approximate nearest neighbor search for two different types of queries: r-Range search and k-NN search. Adapting a novel concept of a ring structure, we define a new index structure MLR-Index (Multi-Layer Ring-based Index) in a metric space and propose time and space efficient algorithms with high accuracy. Evaluations through comprehensive experiments comparing with the best-known high-dimensional indexing method LSH show that our approach is faster for a similar accuracy, and shows higher accuracy for a similar response time than LSH.

Improving Relation Extraction by Exploiting Properties of the Target Relation

Eric Normand, Kevin Grant, Elias Ioup and John Sample
The two new relation extraction algorithms presented herein differ only in their assumption about the nature of the target relation (one-to-many or many-to-many). Both algorithms demonstrate high precision and recall using a small number (n < 10) of seeds. The algorithms are tested on the same relation to show the degree of advantage gained by their differing assumptions. Comparison of the performance of the two algorithms on a one-to-many domain demonstrates the existence of several, previously undocumented behaviors which can be used to improve the performance of relation extraction algorithms. The first is a distinct, inverted u-shape in the initial portion of the recall curve of the many-to-many algorithm. The second is that, as the number of seeds increases, the rate of improvement of the two algorithms descreases to approach the rate at which new information is added via the seeds.

Data Integration with the Dalton Framework - A Case Study

Stefan Jablonski, Bernhard Volz, M. Abdul Rehman, Oliver Archner and Olivier Cure
Data integration has gained a great attention in current scientific applications due to the increasingly high volume of heterogeneous data and proliferation of diverse data generating devices like sensors. Use of sensor networks has recently been evolved in many scientific fields especially in environmental and ecological sciences. These sensors generate streams of raw data at temporal and spatial granularities. Before being used in any analytical experiment, these data are required to undergo several processing steps of integration and transformation since the data, stemming from different sensors, are highly heterogeneous in terms of format, syntax, structure and semantics. Besides these data heterogeneity issues proper transportation of data between data sources, data validation, data filtering and imposing data constraints are also main challenges for normal scientists. In order to tackle these data specific issues, workflow technology has played a vital role in scientific domains. In recent years, many scientific workflow systems (e.g. Kepler, Taverna, and Triana) have appeared in scientific domains such as geology, ecology, biology, chemistry etc. These systems, augmented with rich library of pre configured components, have contributed a lot towards scientific data integration by exploiting ontologies. Even though they offer good means for modeling computational workflows, they have been proved not to be sufficiently strong in addressing the data related issues in clear and transparent way. The DaltOn system offers to improve the productivity of scientists by providing a framework which copes with these issues in well structured manner. Unlike current scientific workflow systems, DaltOn presents a novel approach that implements the clear separation between domain-related, i.e. analysis, and data-specific operations. This separation is supported by easy means of modeling computational workflows based on some built-in operations. In this paper we will elaborate its application in real world scenario taken from ecological research where data are taken from meteorological sensor network and stored at central scientific database.

Energy Smart Management of Scientific Data

Ekow Otoo, Doron Rotem and Shih-Chiang Tsao
Scientific data centers comprised of high-powered computing equipment and large capacity disk storage systems consume considerable amount of energy. A common solution for saving energy in disk systems involves powering down disks that exhibit long idle periods and placing them in standby mode. A file request from a disk in standby mode will incur a performance penalty as it takes time (and energy) to spin up the disk before it can serve a file. For that reason, disks are not powered down immediately upon experiencing an idle period since the energy costs of powering up may exceed the energy savings. The length of the idle period until powering down occurs is called idleness threshold. In this paper, we show that clever file allocation on disks can result in a larger fraction of the disks being placed in long periods of standby mode resulting in more energy savings. We implement file allocation strategies on disks that save energy subject to performance constraints on file access costs. Based on observed workloads of scientific applications and disk characteristics, we provide a methodology for determining file assignment to disks and optimal idleness thresholds. We validate our methods analytically and with simulations using traces taken from a scientific data center.

Adaptive Physical Design for Curated Archives

Tanu Malik, Xiaodan Wang, Debabrata Dash, Amitabh Chaudhary, Randal Burns and Anastasia Ailamaki
We introduce AdaptPD, an automated physical design tool that improves database performance by continuously monitoring changes in the workload and adapting the physical design to suit the incoming workload. Current physical design tools are offline and require specification of a representative workload. AdaptPD is "always on" and incorporates online algorithms which profile the incoming workload to calculate the relative benefit of transitioning to an alternative design. Efficient query and transition cost estimation modules allow AdaptPD to quickly decide between various design configurations. We evaluate AdaptPD with the SkyServer Astronomy database using queries submitted by SkyServer's users. Experiments show that AdaptPD adapts to changes in the workload, improves query performance substantially over offline tools, and introduces minor computational overhead.

Mode Aware Query Processing

Mingrui Wei and Elke Rundensteiner
Many scientific applications include sensor network, outpatient health care monitoring, and wild life tracking require real-time stream query processing. While state-of-the-art techniques for processing window-constrained stream queries tend to employ the delta result strategy (to react to each and every change of the stream sensor measurements), some scientific applications only require to produce results periodically - making the complete result strategy a better choice. In this work, we analyze the trade-offs between the delta and the complete result query evaluation strategies. We then design a solution for hopping window queries processing based on the above analysis. In particular, we propose query operators equipped with the ability to accept both delta and complete results as input and to produce both as output. Unlike prior works, these flexible operators can then be integrated within one single hybrid query plan - taking advantage of both processing methodologies. Third, we design a mode assignment algorithm to optimally assign the input and output modes for each operator in the mode aware query plan. Lastly, mode assignment is integrated with a cost-based plan optimizer. The proposed techniques have been implemented within the WPI stream query engine, called CAPE. Our experimental results demonstrate that our solution routinely outperforms the state-of-the-art single-mode solutions over a large variety of arrival rate distributions and query plan shapes.

Scientific Mashups: Runtime-Configurable Data Product Ensembles

Bill Howe, Harrison Green-Fishback and David Maier
Mashups are gaining popularity as a rapid-development, reuse-oriented programming model to replace monolithic, bottom-up application development. This programming style is attractive for the "long tail" of scientific data management applications, characterized by exploding data volumes, increasing requirements for data sharing and collaboration, but limited software engineering budgets. We observe that scientists already routinely construct a primitive, static form of mashup---an ensemble of related visualizations that convey a specific scientific message encoded as, e.g., a Powerpoint slide. Inspired by their ubiquity, we adopt these conventional data product ensembles as a core model, endow them with interactivity, publish them online, and allow them to be repurposed at runtime by non-programmers. We observe that these scientific mashups must accommodate a wider audience than commerce and entertainment-oriented mashups. Collaborators, students (K12 through graduate), the public, and policy makers are all potential consumers, but each group has a different level of domain sophistication. We explore techniques for adapting one mashup for different audiences by attaching additional context, assigning defaults, and re-skinning component products. Existing mashup frameworks (and scientific workflow systems) emphasize an expressive "boxes-and-arrows" abstraction suitable for engineering individual products but overlooking requirements for organizing products into synchronized ensembles or repurposing them for different audiences. In this paper, we articulate these requirements for scientific mashups, describe an architecture for composing mashups as interactive, reconfigurable, web-based, visualization-oriented data product ensembles, and report on an initial implementation in active use at an Ocean Observatory.

Constraint-based Learning of Distance Functions for Object Trajectories

Wei Yu and Michael Gertz
With the drastic increase of available object trajectory data, the analysis and exploration of trajectories has become a major research focus with a variety of applications. In particular, several approaches have been proposed in the context of similarity-based trajectory retrieval. While these approaches try to be comprehensive by considering the different properties of object trajectories at different degrees, the distance functions are always pre-defined and therefore do not support different views on what users consider similar or dissimilar trajectories in a particular domain. In this paper, we introduce a novel approach to learning distance functions in support of similarity-based retrieval of multi-dimensional object trajectories. Our approach is more generic than existing approaches in that distance functions are determined based on constraints, which specify what object trajectory pairs the user considers similar or dissimilar. Thus, using a single approach, different distance functions can be determined for different users views. We present two learning techniques, transformed Euclidean distance and transformed Dynamic Time Warping. Both techniques determine a linear transformation of the attributes of multi-dimensional trajectories, based on the constraints specified by the user. We demonstrate the flexibility and efficiency of our approach with applications to clustering and classification on real and synthetic object trajectory datasets from different application domains.

Covariant Evolutionary Event Analysis for Base Interaction Prediction using Relational Database Management System for RNA

Weijia Xu, Stuart Ozer, Lei Shang and Robin Gutell
With increasing amount of sequences properly aligned, comparative sequence analysis can accurately identify not only common structures formed by standard base pairing but also new types of structural elements and constraints. However, traditional methods are too computational expensive to perform well on large scale alignment and less effective with the sequences from diversified phylogenetic classifications. We propose a new approach considering coevolutional rates among pairs of positions using phylogenetic information of aligned sequences. With a novel data model to manage relevant information within relational database, our method, implemented within SQL server, showed 90% sensitivity in identifying base pair interactions among 16S ribosomal RNA sequences from Bacteria, at a scale 40 times bigger and 50% better sensitivity of previous study. The results also indicated covariation signals for a few sets of cross-strand base stacking pairs in secondary structure helices, and other subtle constraints in the RNA structure.

Using Workflow Medleys to Streamline Exploratory Tasks

Emanuele Santos, David Koop, Huy Vo, Erik Anderson, Juliana Freire and Claudio Silva
Presentation
Poster
To analyze and understand the growing wealth of scientific data, complex workflows need to be assembled, often requiring the combination of loosely-coupled resources, specialized libraries, distributed computing infrastructure, and Web services. However, constructing these workflows is a non-trivial task, especially for users who do not have programming expertise. This problem is compounded for exploratory tasks, where the workflows need to be iteratively refined. In this paper, we introduce 'workflow medleys', a new framework for manipulating collections of workflows. We propose a workflow manipulation language that includes operations that are common in exploratory tasks and present a visual interface designed for this language. We show how medleys have been applied in three (real) applications. We also present a preliminary expert study which indicates that medleys are effective and provide a framework that simplifies use and combinations of workflows.

Exploring Scientific Workflow Provenance Using Hybrid Queries Over Nested Data and Lineage Graphs

Manish Anand, Shawn Bowers, Timothy McPhillips and Bertram Ludaescher
Existing approaches for representing the provenance of scientific workflow runs largely ignore computation models that work over structured data, including XML. Unlike models based on transformation semantics, these computation models often employ update semantics, in which only a portion of an incoming XML stream is modified by each workflow step. Applying conventional provenance approaches to such models results in provenance information that is either too coarse (e.g., stating that one version of an XML document depends entirely on a prior version) or potentially incorrect (e.g., stating that each element of an XML document depends on every element in a prior version). We describe a generic provenance model that naturally represents workflow runs involving processes that work over nested data collections and that employ update semantics. Moreover, we extend current query approaches to support this provenance model, enabling queries to be posed not only over data lineage relationships, but also over versions of nested data structures produced during a workflow run. These hybrid queries can be expressed against our provenance model using high-level query constructs and issued over multiple physical provenance storage representations and database schemes.

Efficient Evaluation of Generalized Tree-Pattern Queries with Same-Path Constraints

Xiaoying Wu, Dimitri Theodoratos, Stefanos Souldatos, Theodore Dalamagas and Timos Sellis
Presentation
Querying XML data is based on the specification of structural patterns which in practice are formulated using XPath. Usually, these structural patterns are in the form of trees (Tree-Pattern Queries -- TPQs). Requirements for flexible querying of XML data including XML data from scientific applications have motivated recently the introduction of query languages that are more general and flexible than TPQs. These query languages correspond to a fragment of XPath, larger than TPQs, for which efficient non-main-memory evaluation algorithms are not known. In this paper, we consider a query language, called Partial Tree-Pattern Query (PTPQ) language, which generalizes and strictly contains TPQs. PTPQs represent a broad fragment of XPath which is very useful in practice. We show how PTPQs can be represented as directed acyclic graphs augmented with "same-path" constraints. We develop an original polynomial time holistic algorithm for PTPQs under the inverted list evaluation model. To the best of our knowledge, it is the first algorithm to support the evaluation of such a broad fragment of XPath. We provide a theoretic analysis of our algorithm and identify cases where it is asymptotically optimal. In order to assess its performance, we design two other techniques that evaluate PTPQs by exploiting the state-of-the-art existing algorithms for smaller classes of queries. An extensive experimental evaluation shows that our holistic algorithm outperforms the other ones.

View Discovery in OLAP Databases Using Combinatorial Information Theory

Cliff Joslyn, John Burke, Terence Critchlow, Nick Hengartner and Emilie Hogan
Presentation
The capability of OLAP database software systems to handle data complexity comes at a high price for analysts, presenting them a combinatorially vast space of views of a relational database. We respond to the need to deploy technologies sufficient to allow users to guide themselves to areas of local structure by casting the space of "views" of an OLAP database as a combinatorial object of all projections and subsets, and "view discovery" as an search process over that lattice. We equip the view lattice with statistical information theoretical measures sufficient to support a combinatorial optimization process. We outline "hop-chaining" as a particular view discovery algorithm over this object, wherein users are guided across a permutation of the dimensions by searching for successive two-dimensional views, pushing seen dimensions into an increasingly large background filter in a "spiraling" search process. We illustrate this work in the context of data cubes recording summary statistics for radiation portal monitors at US ports.

Experiences on Processing Spatial Data with MapReduce

Ariel Cary, Zhengguo Sun, Vagelis Hristidis and Rishe Naphtali
Presentation
The amount of information in spatial databases is growing as more data is made available. Spatial databases mainly store two types of data: raster data (satellite/aerial digital images), and vector data (points, lines, polygons). The complexity and nature of spatial databases makes them ideal for applying parallel processing. MapReduce is an emerging massively parallel computing model, proposed by Google. In this work, we present our experiences in applying the MapReduce model to solve two important spatial problems: (a) bulk-construction of R-Trees and (b) aerial image quality computation, which involve vector and raster data, respectively. Our algorithms were executed on a Google and IBM cluster, which became available to us through an NSF-supported program. The cluster supports the Hadoop framework - an open source implementation of MapReduce. Our results confirm the excellent scalability of the MapReduce framework in processing parallelizable problems.

The Scientific Data Management Center: Providing Technologies for Large Scale Scientific Exploration

Arie Shoshani


What Makes Scientific Workflows Scientific?

Bertram Ludaescher
Presentation


Cloud Computing for Science

Kate Keahey
Presentation