UrbanGPT

Video Presentation

Abstract

Spatio-temporal prediction aims to forecast and gain insights into the ever-changing dynamics of urban environments across both time and space. Its purpose is to anticipate future patterns, trends, and events in diverse facets of urban life, including transportation, population movement, and crime rates. Although numerous efforts have been dedicated to developing neural network techniques for accurate predictions on spatio-temporal data, it is important to note that many of these methods heavily depend on having sufficient labeled data to generate precise spatio-temporal representations. Unfortunately, the issue of data scarcity is pervasive in practical urban sensing scenarios. In certain cases, it becomes challenging to collect any labeled data from downstream scenarios, intensifying the problem further. Consequently, it becomes necessary to build a spatio-temporal model that can exhibit strong generalization capabilities across diverse spatio-temporal learning scenarios. Taking inspiration from the remarkable achievements of large language models (LLMs), our objective is to create a spatio-temporal LLM that can exhibit exceptional generalization capabilities across a wide range of downstream urban tasks. To achieve this objective, we present the UrbanGPT, which seamlessly integrates a spatio-temporal dependency encoder with the instruction-tuning paradigm. This integration enables LLMs to comprehend the complex inter-dependencies across time and space, facilitating more comprehensive and accurate predictions under data scarcity. To validate the effectiveness of our approach, we conduct extensive experiments on various public datasets, covering different spatio-temporal prediction tasks. The results consistently demonstrate that our UrbanGPT, with its carefully designed architecture, consistently outperforms state-of-the-art baselines. These findings highlight the potential of building large language models for spatio-temporal learning, particularly in zero-shot scenarios where labeled data is scarce.

Technical Description

• Architecture

Figure 1: The overall architecture of the proposed spatio-temporal language model UrbanGPT.

Spatio-Temporal Dependency Encoder. Although large language models demonstrate exceptional proficiency in language processing, they face challenges in comprehending the time-evolving patterns inherent in spatio-temporal data. To overcome this limitation, we propose enhancing the capability of large language models to capture temporal dependencies within spatio-temporal contexts. This is accomplished by integrating a spatio-temporal encoder that incorporates a multi-level temporal convolutional network. By doing so, we enable the model to effectively capture the intricate temporal dependencies across various time resolutions, thereby improving its understanding of the complex temporal dynamics present in the spatio-temporal data. Specifically, our spatio-temporal encoder is composed of two key components: a gated dilated convolution layer and a multi-level correlation injection layer.
Spatio-Temporal-Text Alignment. In order to enable language models to effectively comprehend spatio-temporal patterns, it is crucial to align textual and spatio-temporal information. This alignment allows for the fusion of different modalities, resulting in a more informative representation. By integrating contextual features from both textual and spatio-temporal domains, we can capture complementary information and extract higher-level semantic representations that are more expressive and meaningful. To achieve this objective, we utilize a lightweight alignment module to project the spatio-temporal dependencies representations.
Spatio-Temporal Prompt Instructions. In scenarios involving spatio-temporal prediction, both temporal and spatial information contain valuable semantic details that contribute to the model's understanding of spatio-temporal patterns within specific contexts. For instance, traffic flow in the early morning differs significantly from rush hour, and there are variations in traffic patterns between commercial and residential areas. As a result, we recognize the potential of representing both temporal and spatial information as prompt instruction text. We leverage the text understanding capabilities of large language models to encode this information, enabling associative reasoning for downstream tasks.
Spatio-Temporal Instruction-Tuning of LLMs. To fine-tune LLMs using instructions to generate spatio-temporal forecasts in textual format poses two challenges. Firstly, spatio-temporal forecasting typically relies on numerical data, which differs in structure and patterns from natural language that language models excel at processing, focusing on semantic and syntactic relationships. Secondly, LLMs are typically pre-trained using a multi-classification loss to predict vocabulary, resulting in a probability distribution of potential outcomes. This contrasts with the continuous value distribution required for regression tasks. To address these challenges, UrbanGPT adopts a different strategy by refraining from directly predicting future spatio-temporal values. Instead, it generates forecasting tokens that aid in the prediction process. These tokens are subsequently passed through a regression layer, which maps the hidden representations to generate more accurate predictive values.

Figure 2: Illustration of spatio-temporal prompt instructions encoding the time- and location-aware information.

• Experiments

Zero-Shot Prediction Performance

In this section, we thoroughly evaluate the predictive performance of our proposed model in zero-shot scenarios. Our objective is to assess the model's effectiveness in predicting spatio-temporal patterns in geographical areas that it has not encountered during training. This evaluation encompasses both cross-region and cross-city settings, allowing us to gain insights into the model's generalization capabilities across different locations.

(1) Prediction on Unseen Regions within a City

Cross-region scenarios entail using data from certain regions within a city to forecast future conditions in other regions that have not been encountered by the model. The results highlight the exceptional performance of our proposed model in both regression and classification tasks on various datasets, surpassing the baseline models in zero-shot prediction.

Figure 3: Our model’s performance in zero-shot prediction is evaluated on three diverse datasets: NYC-taxi, NYC-bike, and NYC-crime, providing a comprehensive assessment of its predictive capabilities in unseen situations.

(2) Cross-City Prediction Task

To assess the performance of our model in cross-city prediction tasks, we conducted tests on the CHI-taxi dataset, which was not seen during the training phase.

Figure 4: Time step-based prediction comparison experiment conducted on the CHI-taxi dataset.

Classical Supervised Prediction Task

This section examines the predictive capabilities of our UrbanGPT in end-to-end supervised prediction scenarios, as presented in Figure 5.

Figure 5: Evaluation of performance in the end-to-end supervised setting on the NYC-taxi and NYC-bike datasets.

Ablation Study

This section investigates the impact of different key components on the performance of our model, as illustrated in Figure 6. Our rigorous testing primarily revolves around the zero-shot scenario using the NYC-taxi dataset.

Figure 6: Ablation study of our proposed UrbanGPT.

Model Robustness Study

In this section, we focus on evaluating the robustness of our UrbanGPT across different spatio-temporal pattern scenarios. We categorize regions based on the magnitude of numerical variations, such as taxi flow, during a specific time period. Lower variance indicates stable temporal patterns, while higher variance suggests diverse spatio-temporal patterns in active commercial zones or densely populated areas.

Figure 7: Robustness study of the UrbanGPT model.

Demonstrations

• Example-1: Bike flow prediction (Zero-shot scenario)

Given the historical data for bike flow over 12 time steps in a specific region of New York City, the recorded bike inflows are [12 8 9 15 10 12 14 9 6 12 25 20], and the recorded bike outflows are [5 12 12 15 10 11 6 9 4 6 10 15]. The recording time of the historical data is 'January 14, 2020, 12:00, Tuesday to January 14, 2020, 17:30, Tuesday, with data points recorded at 30-minute intervals'. Here is the region information: This region is located within the Staten Island borough district and encompasses various POIs within a one-kilometer radius, covering Education Facility, Cultural Facility, Commercial, Transportation Facility categories. We now aim to predict the bike inflow and outflow for the next 12 time steps during the time period of 'January 14, 2020, 18:00, Tuesday to January 14, 2020, 23:30, Tuesday, with data points recorded at 30-minute intervals'. To improve prediction accuracy, a spatio-temporal model is utilized to encode the historical bike data as tokens , where the first and the second tokens correspond to the representations of bike inflow and outflow. Please conduct an analysis of the traffic patterns in this region, taking into account the provided time and regional information, and then generate the predictions.

Based on the given information, the predictions of bike inflow and outflow in this region are [20 21 22 21 17 12 9 7 6 5 5 5] and [11 10 8 7 6 6 5 4 4 3 3 2].

Ground True: [26 38 28 24 20 13 16 9 5 3 5 2] and [12 19 10 10 11 9 6 5 4 4 2 2].

• Example-2: Taxi flow prediction (Zero-shot scenario)

Given the historical data for taxi flow over 12 time steps in a specific region of New York City, the recorded taxi inflows are [111 86 90 78 49 75 46 32 20 14 9 9], and the recorded taxi outflows are [100 91 58 39 47 47 19 22 21 9 16 10]. The recording time of the historical data is 'January 5, 2020, 20:30, Sunday to January 6, 2020, 02:00, Monday, with data points recorded at 30-minute intervals'. Here is the region information: No description is available for this region. Now we want to predict the taxi inflow and outflow for the next 12 time steps during the time period of 'January 6, 2020, 02:30, Monday to January 6, 2020, 08:00, Monday, with data points recorded at 30-minute intervals'. To improve prediction accuracy, a spatio-temporal model is utilized to encode the historical taxi data as tokens , where the first and the second tokens correspond to the representations of taxi inflow and outflow. Please conduct an analysis of the traffic patterns in this region, taking into account the provided time and regional information, and then generate the predictions.

Based on the given information, the predictions of taxi inflow and outflow in this region are [4 4 2 3 4 13 25 39 56 80 101 117] and [6 4 2 2 4 13 24 39 56 80 100 115].

Ground True: [9 3 3 4 6 3 30 22 58 74 91 110] and [5 10 7 15 11 19 43 46 90 146 213 187].

• Example-3: Crime prediction (Zero-shot scenario)

Given the historical data for crime over 12 time steps in a specific region of New York City, the recorded number of burglaries is [1 0 1 2 0 0 1 0 0 3 1 1], and the recorded number of larcenies is [4 5 3 2 2 2 3 4 2 3 0 4]. The recording time of the historical data is 'October 20, 2020, 00:00, Tuesday to October 31, 2020, 00:00, Saturday, with data points recorded at 1-day intervals'. Here is the region information: No description is available for this region. Now we aim to predict whether the two specific crimes will occur in this region within the next 12 time steps during the time period of 'November 1, 2020, 00:00, Sunday to November 12, 2020, 00:00, Thursday, with data points recorded at 1-day intervals'. To improve prediction accuracy, a spatio-temporal model is utilized to encode the historical crime data as tokens , where the first and the second tokens correspond to the representations of burglaries and larcenies. Please conduct an analysis of the crime patterns in this region, considering the provided time and regional information, and then generate the prediction of crime occurrence probability.

Based on the given information, the predictions of crime occurrence probability in this region are [0.46 0.46 0.48 0.50 0.49 0.50 0.51 0.5 0.49 0.5 0.49 0.49] and [0.79 0.83 0.84 0.85 0.87 0.89 0.89 0.87 0.87 0.87 0.86 0.85].

Ground True: [0 0 0 0 0 1 0 0 0 1 0 0] and [2 3 1 2 3 3 1 1 2 3 2 2]. (Greater than 0.5 is considered as occurrence, while less than 0.5 is considered as non-occurrence.)

UrbanGPT:

Spatio-Temporal Large Language Models