SSTD 2025

Session Detail

Session 1 - Spatial and Spatio-temporal Data and Language Models

SNC: A Framework for Verification and Generation of Spatial NLQ Corpora

Weijia Yi (Nanjing University of Aeronautics and Astronautics); Xieyang Wang (Nanjing University of Aeronautics and Astronautics); Mengyi Liu (Nanjing University of Aeronautics and Astronautics); Jianqiu Xu (Nanjing University of Aeronautics and Astronautics)*

Spatial databases are garnering considerable attention due to their extensive applications in location-based services, urban planning, and intelligent transportation. However, natural language interfaces for spatial databases (NLIDBs) continue to encounter significant challenges, particularly in processing complex and diverse user queries. A high-quality natural language query (NLQ) corpus is essential to train robust NLIDB models. Existing corpora often suffer from a lack of diversity and syntactic accuracy, which restricts the generalization capabilities of NLIDBs. To address this issue, we propose a query detection and repair method tailored to spatial databases and construct a high-quality spatial NLQ corpus. This approach consists of (i) a verification module for spatial NLQ corpora (SNC-V), which leverages spatial entity and relation knowledge bases to identify and correct errors in NLQs, and (ii) a generation module for spatial NLQ corpora (SNC-G), which generates structured and diverse queries. Experimental results show that the SNC-V improves the conversion rate and accuracy of spatial NLIDB models, while the SNC-G achieves a query generation speed of 1,424 queries per second, a diversity score of 0.87, and an accuracy of 95.2%.

Comprehending Spatio-temporal Data via Cinematic Storytelling using Large Language Models

Panos Kalnis (King Abdullah University of Science and Technology)*; Shuo Shang (University of Electronic Science and Technology of China); Christian Jensen (Aalborg University)

Spatio-temporal data captures complex dynamics across both space and time, yet traditional visualizations are complex, require domain expertise and often fail to resonate with broader audiences. Here, we propose a storytelling-based framework for interpreting spatio-temporal datasets, transforming them into compelling, narrative-driven experiences. We utilize large language models and employ retrieval augmented generation (RAG) techniques to generate comprehensive stories. Drawing on principles common in cinematic storytelling, we emphasize clarity, emotional connection, and audience-centric design. As a case study, we analyze a dataset of taxi trajectories. Two perspectives are presented: a captivating story based on a heat map that visualizes millions of taxi trip endpoints to uncover urban mobility patterns; and a detailed narrative following a single long taxi journey, enriched with city landmarks and temporal shifts. By portraying locations as characters and movement as plot, we argue that data storytelling drives insight, engagement, and action from spatio-temporal information. The case study illustrates how storytelling can bridge the gap between data complexity and human understanding. The aim of this short paper is to provide a glimpse to the potential of the cinematic storytelling technique as an effective communication tool for spatio-temporal data, as well as to describe open problems and opportunities for future research.

Session 2 - Data Mining and Forecasting

Multivariate Tests for Point and Probabilistic Forecasts, with an Application to Spatio-temporal Weather Forecast Evaluation

Annette Jing (Stanford University)*; Phillip Jang (Amazon)

Spatio-temporal forecasting has become increasingly important across a wide range of applications, from weather prediction to transportation and energy systems. These forecasting tasks naturally involve multiple interdependent variables observed over space and time, making them inherently multivariate in nature. However, evaluation techniques for multivariate forecasts often rely on simple averages of univariate metrics, potentially overlooking joint forecast quality across multiple variables. Such naive comparisons also cannot reliably distinguish whether the observed improvements are genuine or simply due to random variation. Furthermore, despite growing interest in probabilistic forecasting, there remains a lack of formal evaluation methods tailored to assessing the joint distributional accuracy of multiple time series forecasts. To address this gap, we propose multivariate extensions to two popular forecast evaluation hypothesis tests. First, we introduce a one-sided multivariate Diebold-Mariano test (Diebold and Mariano, 1995; Harvey et al., 1997; Mariano and Preve; 2012) to assess whether the point forecast of one model outperforms another for one or more target variables. Second, we develop a test for correct specification of joint distributions by expanding upon the probability integral transform-based approach originally proposed for univariate densities (Rossi and Sekhposyan, 2019). Both tests are validated through theoretical analysis as well as simulation studies. Finally, we demonstrate the practical value of these multivariate tests through an application to spatio-temporal weather forecasts. The proposed tests detect the improvements of a functional time series model (Jang and Matteson, 2023) over a naive model in describing the joint distribution of weather forecast errors.

Distributed Community Detection in Temporal Graphs

Konstantinos Christopoulos (University of Patras)*; Konstantinos Tsichlas (University of Patras)

Community detection is a pivotal task in network analysis and has been extensively studied over the past 25 years, primarily in the context of static networks. It involves partitioning the network into groups of nodes that are more densely connected to each other than to the rest of the network. In recent years, an additional layer of complexity has emerged with the study of historical graphs, where each node and edge is associated with a valid time interval. Unlike traditional static graphs, historical graphs incorporate temporal information, enabling the analysis of how connections between entities evolve over time. In this study, we propose a distributed algorithm for community detection in historical graphs, constrained to a user-specified query time interval. Given such a query interval, our method identifies communities by evaluating the contribution of each edge and node active during that period. To the best of our knowledge, this setting has not been considered before in the literature.

Discovering Super-Colocation Patterns: A Summary of Results

Shuai An (University of Minnesota)*; Shesha Sai Kumar Reddy Sadu (University of Minnesota); Arun Sharma (University of Minnesota); Majid Farhadloo (University of Minnesota); Shashi Shekhar (University of Minnesota)

Given a collection of Boolean spatial features, Super-Colocation Pattern Discovery identifies subsets of features that are not only frequently located together but also have dense interactions. For example, the presence of multiple immune cells around cancer cells is more interesting to oncologists than a simple colocation between immune and cancer cells. This problem is important due to its societal applications, including oncology, transportation, and economic analysis. The problem is challenging due to the need to model interaction density among a subset of Boolean spatial features. Related work on colocation pattern mining is limited due to a lack of conceptual, logical, and physical models that accurately represent interaction density. Traditional interest measures (e.g., participation index) largely focus on the mere presence of another spatial feature type and overlook the number or density of neighboring instances. We propose a novel interest measure, termed Super-Colocation Density, which utilizes a matrix or tensor along with a utility-based index to quantify the interaction density among subsets of spatial features. We also introduce novel Super-Colocation Mining algorithms and evaluate the proposed methods through both theoretical analysis and experiments with real and synthetic data.

Session 3 - Maritime Data

A Multi-Modal Knowledge-Enhanced Framework for Vessel Trajectory Prediction

Haomin Yu (Aalborg University); Tianyi Li (Aalborg University)*; Kristian Torp (Aalborg University); Christian S. Jensen (Aalborg University)

Accurate vessel trajectory prediction facilitates improved navigational safety, routing, and environmental protection. However, existing prediction methods are challenged by the irregular sampling time intervals of the vessel tracking data from the global AIS system and the complexity of vessel movement. These aspects render model learning and generalization difficult. To address these challenges and improve vessel trajectory prediction, we propose Multi-modAl Knowledge-Enhanced fRamework (MAKER) for vessel trajectory prediction. To contend better with the irregular sampling time intervals, MAKER features a Large language model-guided Knowledge Transfer (LKT) module that leverages pre-trained language models to transfer trajectory-specific contextual knowledge effectively. To enhance the ability to learn complex trajectory patterns, MAKER incorporates a Knowledge-based Self-paced Learning (KSL) module. This module employs kinematic knowledge to progressively integrate complex patterns during training, allowing for adaptive learning and enhanced generalization. Experimental results on two vessel trajectory datasets show that MAKER can improve the prediction accuracy of state-of-the-art methods by 12.08%—17.86%.

Physics-Informed Neural Networks for Vessel Trajectory Prediction: Learning Time-Discretized Kinematic Dynamics via Finite Differences

Md Mahbub Alam (Dalhousie University); José F. Rodrigues-Jr (University of Sao Paulo); Amilcar Soares-Jr (Linnaeus University); Gabriel Spadon (Dalhousie University)*

Accurate vessel trajectory prediction is crucial for navigational safety, route optimization, traffic management, search and rescue operations, and autonomous navigation. Traditional data-driven models lack real-world physical constraints, leading to forecasts that violate vessel motion dynamics, such as in scenarios with limited or noisy data where sudden course changes or speed variations occur due to external factors. To address this limitation, we propose a Physics-Informed Neural Network (PINN) approach for trajectory prediction that integrates a streamlined kinematic model for vessel motion into the neural network training process via first- and second-order, finite-difference physics-based loss functions. These loss functions, discretized using the first-order forward Euler method, Heun's second-order approximation, and refined with a midpoint approximation based on Taylor series expansion, enforce fidelity to fundamental physical principles by penalizing deviations from expected kinematic behavior. We evaluated PINN using real-world AIS datasets that cover diverse maritime conditions and compared it with state-of-the-art models. Our results demonstrate that the proposed method reduces average displacement errors by up to 32% across models and datasets while maintaining physical consistency. These results enhance model reliability and adherence to mission-critical maritime activities, where precision translates into better situational awareness in the oceans.

ImPORTance - Machine Learning-Driven Analysis of Global Port Significance and Network Dynamics for Improved Operational Efficiency

Emanuele Carlini (ISTI/CNR); Domenico Di Gangi (ISTI/CNR); Vinicius Monteiro de Lira (UFC); Hanna Kavalionak (ISTI/CNR); Amilcar Soares (Linnaeus University); Gabriel Spadon (Dalhousie University)*

Seaports play a crucial role in the global economy, and researchers have sought to understand their significance through various studies. In this paper, we aim to explore the common characteristics shared by important ports by analyzing the network of connections formed by vessel movement among them. To accomplish this task, we adopt a bottom-up network construction approach that combines three years' worth of AIS (Automatic Identification System) data from around the world, constructing a Ports Network that represents the connections between different ports. Through this representation, we utilize machine learning to assess the relative significance of various port features. Our model examined such features and revealed that geographical characteristics and the depth of the port are indicators of a port's significance to the Ports Network. Accordingly, this study employs a data-driven approach and utilizes machine learning to provide a comprehensive understanding of the factors contributing to the importance of ports. The outcomes of our work aim to inform decision-making processes related to port development, resource allocation, and infrastructure planning within the industry.

Session 4 - Road Networks

BRkNN-light: Batch Processing of Reverse k-Nearest Neighbor Queries for Moving Objects on Road Networks

Anbang Song (Yantai University)*; Ziqiang Yu (Yantai University); Wei Liu (Yantai University); Yating Xu (Yantai University); Mingjin Tao (Yantai University)

The Reverse k-Nearest Neighbor (RkNN) query over moving objects on road networks seeks to find all moving objects that consider the specified query point as one of their $k$ nearest neighbors. In location based services, many users probably submit RkNN queries simultaneously. However, existing methods largely overlook how to efficiently process multiple such queries together, missing opportunities to share redundant computations and thus reduce overall processing costs. To address this, this work is the first to explore batch processing of multiple RkNN queries, aiming to minimize total computation by sharing duplicate calculations across queries. To tackle this issue, we propose the BRkNN-Light algorithm, which uses rapid verification and pruning strategies based on geometric constraints, along with an optimized range search technique, to speed up the process of identifying the RkNNs for each query. Furthermore, it proposes a dynamic distance caching mechanism to enable computation reuse when handling multiple queries, thereby significantly reducing unnecessary computations. Experiments on multiple real-world road networks demonstrate the superiority of the BRkNN-Light algorithm on the processing of batch queries.

The Batch Insertion Operator for Shared Mobility Route Planning on Time-Dependent Road Networks

Aaditya Mukherjee (University of Victoria); Sean Chester (University of Victoria)*; Mario Nascimento (Northeastern University)

Effective route planning for shared mobility is crucial for user experience in transportation services such as ridesharing, logistics, and food delivery. However, in high-volume applications with travel times that depend on the time of day, high computational complexity can impair route planning throughput. This paper proposes a batch insertion operator that handles multiple concurrent requests. It proposes a novel partitioning solution that uses route length rather than spatial proximity as a partitioning criterion, then adaptively assigns workers to larger partitions. Compared to a greedy solution that repeatedly applies the standard, i.e., non-batched, insertion operator, query processing is significantly accelerated with minor to no degradation in solution quality. Extensive experiments using three city-scale datasets demonstrate that this approach can provide speedups up to 15× and is also amenable to parallelisation.

Session 5 - Mobility Analysis

Scalable Processing of Moving Flock Patterns

Andres Oswaldo Calderon Romero (Pontificia Universidad Javeriana)*; Vassilis J. Tsotras (University of California); Petko Balakov (Esri); Marcos Vieira (Google)

This work presents a scalable approach for identifying moving flock patterns in large trajectory databases, addressing the inefficiencies in current techniques for handling large spatio-temporal datasets. A moving flock pattern refers to a group of entities that move closely together within a defined spatial radius for a minimum time interval. We focus on improving the state-of-the-art sequential algorithms, which is capable of detecting such patterns but suffers from high computational costs, particularly with large datasets. By leveraging distributed frameworks and utilizing spatial partitioning, the proposed solution aims to significantly reduce the time required for detecting flock patterns. We highlight the challenges of spatial and temporal joins in massive datasets and offer optimizations like partition-based parallelism and strategies for managing flock patterns that span multiple partitions. The paper presents an experimental evaluation using synthetic trajectory datasets, demonstrating that the proposed methods substantially improve scalability and performance compared to existing sequential algorithms.

Generalized Origin-Destination-Time Flow Patterns

Chrysanthi Kosyfaki (The University of Hong Kong )*; Nikos Mamoulis (University of Ioannina); Reynold Cheng (The University of Hong Kong); Ben Kao (The University of Hong Kong)

Analyzing flow of objects or data at different granularities of space and time can unveil interesting insights or trends. For example, transportation companies, by aggregating passenger travel data (e.g., counting passengers travelling from one region to another), can analyze movement behavior. In this paper, we study the problem of finding important trends in passenger movements between regions at different granularities. We define Origin (𝑂), Destination (𝐷), and Time (𝑇) patterns (ODT patterns) and propose an algorithm that enumerates them. We propose optimizations that greatly reduce the search space and the computational cost of pattern enumeration. We also propose variants (constrained patterns and top-𝑘 patterns) that could be useful to different application scenarios. We evaluate our methods on three real datasets and identify significant ODT flow patterns in them.

Private Next Location Prediction using Transformers: Enhancing Accuracy under Differential Privacy Constraints

Andrei Ouatu (Politehnica University Bucharest); Gabriel Ghinita (Hamad Bin Khalifa University)*; Razvan Rughinis (Politehnica University Bucharest)

Next location prediction is an essential task in location-based systems where personalization for users is a key factor. Lately, to predict the next location, machine learning and deep learning techniques (as transformer decoders) based on historic users’ trajectories are employed. However, training a machine learning or deep learning model based on users’ personal data (as spatial trajectories) can make the model to memorize the mobility pattern of users raising the risks of a successful attack able to extract these personal information and movement behaviors. In this paper, we investigate how private training of models dealing with location data under Differential Privacy guarantees can be achieved, and we identify the causes of its sub-optimal performance. We propose a series of methods that enhance the overall accuracy of the trained model, while maintaining differential privacy constraints. Our experimental evaluation shows that the proposed methods achieve a higher accuracy than the benchmark, under the same privacy constraints. When combined together our methods are able to outperform the non-private training, while benefiting from the protection given by the Differential Privacy guarantees, marking a significant advancement towards private training of deep learning models used in real-world location-based systems.

Target and Non-target Category Classification from GPS and Check-in Data

Ryo Shirai (The University of Osaka); Ryo Imai (LY Coorporation); Daichi Amagata (Osaka University)*

GPS data analysis is one of the main operators in geographical information systems. However, because of security and privacy issues, we often face situations where GPS data cannot be obtained frequently. Such situations and the measurement errors of GPS coordinates make identifying user behaviors challenging. In this work, we assume this setting and tackle the classification problem of target and non-target categories for the first time. Target categories are store categories in the scope of a service provider, whereas non- target ones are those that are not in. Given a GPS point, this problem estimates which category this location belongs to, so it is a binary classification problem. This problem has two main difficulties. First, we cannot obtain labeled data of the non-target categories. Second, many GPS data have error ranges and no labels, i.e., they do not clarify where the users stay. To solve the problem while addressing these difficulties, we propose a new classification method based on machine learning. We exploit GPS and check-in data to obtain user feature vectors at a given time. Our loss function considers non- stay information on each store category to identify the non-target space in the feature space. From these techniques, we compute the probability of staying in one of the (non-)target categories. We conduct experiments on real-world datasets, and the results show the effectiveness of our method

Session 6 - Time Series and Interval Data

ReefsDB: An Efficient Design and Implementation of Time-Series Store on SSDs

Shuai Liu (The Institute of Software, Chinese Academy of Sciences (ISCAS)); Ying Qiao (The Institute of Software, Chinese Academy of Sciences (ISCAS))*; Chang Leng (The Institute of Software, Chinese Academy of Sciences (ISCAS)); Hongan Wang (The Institute of Software, Chinese Academy of Sciences (ISCAS))

The rapid growth of the Internet of Things (IoT) has resulted in an explosive increase in time-series data, making time-series databases (TSDBs) such as InfluxDB and OpenTSDB essential components in IoT ecosystems. At the same time, the decreasing cost of SSDs has facilitated their increasing adoption in large-scale data centers. Traditional TSDBs are primarily based on Log-Structured Merge Tree (LSM-tree) optimized for HDDs, which convert random reads and writes into sequential ones. However, these systems fail to fully exploit the unique characteristics of SSDs, such as random I/O operations and internal parallelism. In this paper, we present ReefsDB, an LSM-tree-based TSDB that is highly optimized for SSDs and implemented using the Rust programming language. We evaluate ReefsDB using the Time Series Benchmark Suite for write-intensive workloads, and the results demonstrate that ReefsDB is 2.9×∼3.2× faster than InfluxDB in write performance, while reducing read latencies by 27%∼62%.

TIDE: Indexing Time Intervals by Duration and Endpoint

Kai Wang (HKUST); Moin Moti (HKUST); Dimitris Papadias (HKUST)*

Indexes for large collections of intervals are common in temporal databases, where each record has a lifespan, or validity interval. We propose a universal representation that encapsulates various interval indexes using diagonal corner structures, providing valuable insights about their effectiveness. Moreover, we exploit our findings to develop TIDE, a disk-based index for historical intervals. TIDE adopts a two-level architecture. A top tree organizes intervals by their duration. The leaf nodes of the top tree correspond to the root nodes of bottom trees, ordering intervals by their endpoints. Both top and bottom trees are append-only B+-Trees to facilitate fast insertions. An experimental evaluation with real data sets shows that TIDE achieves impressive performance gains with respect to its direct competitor, on insertion (up to x100) and query processing (up to x7000) speed.

Top-k Range Search on Weighted Interval Data

Daichi Amagata (Osaka University)*; Jimin Lee (The University of Osaka)

Weighted intervals are ubiquitous because many objects are associated with temporal and numeric dimensions. As interval datasets are usually large, efficient management and processing of large weighted interval data are required. This paper addresses the problem of top-k range search on weighted interval data, which retrieves k intervals with the largest weight among a set of intervals overlapping a given query interval. It finds important analytical applications for vehicles, events, and cryptocurrencies. Existing algorithms for range search on interval data are inefficient for this problem, because they need to search for all intervals that overlap a given query interval. To overcome this inefficiency issue, we first provide a baseline algorithm and then propose two data structures and their associated algorithms. Our first proposed algorithm is practically fast but requires O(n log k) time, where 𝑛 is the number of intervals, whereas the other requires less than O(n log k) time. We conduct extensive experiments on real-world datasets, and the results show that our algorithms outperform baseline techniques in most cases.