Indexing for Querying Metric Spaces

Speaker: Dr. Yunjun Gao, Zhejiang University, China

Abstract: Spatial queries, including similarity search, similarity joins, and aggregate k nearest neighbors queries, are useful in many areas, such as multimedia retrieval, data integration, and computational biology, to name but a few. However, they are not yet supported well by commercial DBMS. This may be due to the complex data types involved and the needs for flexible similarity criteria seen in real applications. In this talk, we propose an efficient disk-based metric access method, the Space-filling curve and Pivot-based B+-tree (SPB-tree), to accelerate query processing and support a wide range of data types and similarity metrics. The SPB-tree uses a small set of so-called pivots to reduce significantly the number of distance computations, utilizes a space-filling curve to cluster the data into compact regions, thus improving storage efficiency, and employs a B+-tree with minimum bounding box information as the underlying index. The SPB-tree also utilizes a separate random access file to efficiently manage a large and complex data. By design, it is easy to integrate the SPB-tree into an existing DBMS. In this talk, we also present efficient similarity search, similarity join, and aggregate k nearest neighbor query algorithms and corresponding cost models based on the SPB-tree. In addition, we develop a distributed geo-textual image retrieval and recommendation system (I2RS), which employs SPB-trees to index geo-textual images, and utilizes metric similarity queries, including time-aware range and k nearest neighbor queries, to provide a variety of geo-textual image retrieval and recommendation services.

Influence computation in spatial databases

Speaker: Dr. Muhammad Aamir Cheema (Monash University, Australia)

Abstract: Consider a set of facilities (e.g., restaurants, fuel stations) and a set of users (e.g., diners, drivers). A user may issue a spatial query to find the facilities matching her preferences. In this tutorial, we focus on three most important queries issued by the users: 1) k nearest neighbors queries; 2) top-k queries; and 3) skyline queries. We say that a user u is interested in a facility q if q is one of the answers for the spatial query issued by u. Then, the influence set of a query facility is defined to be the set of all the users that are interested in q. Influence computation has attracted significant attention in the past few years due to its importance in many fields such as marketing, decision support systems and location based services. For example, a restaurant owner may want to find the users that are interested in her restaurant to send special deals. In this tutorial, we first highlight the importance of influence computation in spatial databases using various examples. Then, we introduce three most notable influence related queries: reverse k nearest neighbor (RkNN) queries, reverse top-k queries and reverse skyline queries. Finally, we present key ideas and insights in the most significant techniques to answer these queries.

Biography: Muhammad Aamir Cheema is a Lecturer at Clayton School of Information Technology, Monash University, Australia. He obtained his PhD from UNSW Australia in 2011. He has significant research experience in the areas of location-based services, spatio-temporal databases, mobile and pervasive computing and probabilistic databases and publishes frequently in top-tier conferences and journals such as PVLDB, SIGMOD, ICDE, VLDBJ, TODS, TKDE etc. He has won several prestigious awards and honours including 2012 Malcolm Chaikin Prize for Research Excellence in Engineering, 2014 Deans Award for Excellence in Research by an ECR, two CiSRA best-paper-of-the-year awards (2009 and 2010), two one-of-the-best-papers awards (ICDE 2010 and ICDE 2012), two best paper awards (WISE 2013 and ADC 2010), and a poster-of-the-day award (ICDE 2011). He was the publication chair for DASFAA 2015, PC co-chair for ADC 2015, local co-chair for APWeb 2013 and workshop co-chair for MSTD 2013.

Providing Retrievability Guarantees in Cloud-Based Storage Services

Speaker: Dr. Yin Yang, College of Science and Engineering, Hamad Bin Khalifa University (HBKU)

Abstract: Nowadays users store increasingly large amounts of data in cloud-based storage services such as Google Drive, Apple iCloud Drive and Microsoft OneDrive. For important data, the user often wants to verify that her data stored over the cloud have not been corrupted or compromised, which we call data retrivability. Simple approaches to this problem tend to be inefficient (e.g., the user could download the entire dataset and verify it locally, which incurs high communication overhead) or ineffective (e.g., the user could download and verify random blocks of the data, which tends to miss small-scale data corruptions). Meanwhile, traditional data authentication methods often burden the cloud service with heavy computations, leading to increased costs for the service. This tutorial introduces to the audience several state-of-the-art solutions for verifying data retrievability. We will overview their main ideas, and compare their efficiency, effectiveness and security guarantees. Additionally, we will discuss several interesting issues and application scenarios that have not been sufficiently addressed by existing methods.

On Repairing Structural Problems In Semi-structured Data

Speaker: Dr. Shanshan Ying, Advanced Digital Sciences Center, Singapore

Abstract: Semi-structured data such as XML are popular for data interchange and storage. However, many XML documents have improper nesting where open- and close-tags are unmatched. Since some semi-structured data (e.g., Latex) have a flexible grammar and since many XML documents lack an accompanying DTD or XSD, we focus on computing a syntactic repair via the edit distance. To solve this problem, we propose a dynamic programming algorithm which takes cubic time. While this algorithm is not scalable, well-formed substrings of the data can be pruned to enable faster computation. Unfortunately, there are still cases where the dynamic program could be very expensive; hence, we give branch-and-bound algorithms based on various combinations of two heuristics, called MinCost and MaxBenefit, that trade off between accuracy and efficiency. Finally, we experimentally demonstrate the performance of these algorithms on real data.