Cities worldwide are challenged by increasing urban flood risks. Precise and realistic measures are required to reduce flooding impacts. However, currently implemented sewer and topographic models do not provide realistic predictions of local flooding occurrence during heavy rain events. Assessing other factors such as spatially distributed rainfall, socioeconomic characteristics, and social sensing, may help to explain probability and impacts of urban flooding. Several spatial datasets have been recently made available in the Netherlands, including rainfall-related incident reports made by citizens, spatially distributed rain depths, semidistributed socioeconomic information, and buildings age. Inspecting the potential of this data to explain the occurrence of rainfall related incidents has not been done yet. Multivariate analysis tools for describing communities and environmental patterns have been previously developed and used in the field of study of ecology. The objective of this paper is to outline opportunities for these tools to explore urban flooding risks patterns in the mentioned datasets. To that end, a cluster analysis is performed. Results indicate that incidence of rainfall-related impacts is higher in areas characterized by older infrastructure and higher population density.

Cities are vulnerable to rainfall flooding risks; rainfall can affect electrical installations, household contents, road traffic, private and public assets, and business activities (ten Veldhuis et al., 2011; Ashley et al., 2005). Therefore, adaptation measures to better cope with those risks are required. Smart instrumentation and drainage maintenance (e.g. Gaitan et al., 2014; ten Veldhuis and Clemens, 2011), and emergency protocols (e.g. Melo et al., 2015), as well as retrofitting of existing infrastructure and urban redevelopment (e.g. Jacobs, 2012), are examples of such measures. However, their effectiveness depends on the available knowledge of the mechanisms leading to damage after heavy rains, which are still not completely understood (Gaitan et al., 2015; Spekkers et al., 2013, 2014; ten Veldhuis et al., 2011).

Previous studies indicated that currently implemented drainage models do not provide realistic predictions of local flooding occurrence during heavy rain events (Fontanazza et al., 2011; Ochoa-Rodriguez et al., 2014; Gaitan et al., 2015). Studying the spatial distribution of possible explanatory variables may help to explain how a heavy rainfall event triggers occurrence of flooding impacts in a city. Recent works have analyzed whether occurrence of citizens' complaints and insurance claim reports on rainfall-related incidents could be explained by variability in urban topography (Gaitan et al., 2015) and rainfall intensity (Spekkers et al., 2013). Response and explaining variables in those studies were selected a-priori. Their results show that for the considered cases of delta city conditions, pluvial flooding impacts cannot be explained solely by rainfall intensity or urban topography. The flat geography of delta areas may be part of the reason of the low explaining power of meteorological and topographic variables; this emphasizes the importance of exploring additional factors to further model urban flooding risks.

Spekkers et al. (2014) used a decision tree analysis for exploring the extent in which different environmental and socio-economic characteristics explain variances in insurance claim data due to rainfall-related damage. Due mostly to privacy issues, accessing them required explicit agreement with government and private parties. The decision tree explained around 25 % of variance in claim occurrence, improving beyond the 10 to 20 % of explaining power obtained via multiple regression models. Remaining unexplained variance was suggested to be related to a coarse level of spatial aggregation during the analysis and to the possible lack of key explanatory variables, which were not included in the study.

Multiple information sources describing urban characteristics have recently become publicly available under an Open Data policy in the Netherlands (Dutch Ministry of Interior and Kingdom Relations, 2014). This offers the opportunity to explore the potential of such data to provide insights about pluvial flood risks. Due to privacy issues, socioeconomic data in these sources has been aggregated, but their open availability enables flexible use of combinations of datasets, models, and different spatial aggregations, without the need of special permits of private or public parties.

Data sources used in this study. Total number of data points and mean refer only to case study area. Points with secret or not available data have been excluded from this table. Only buildings in use have been considered.

These multivariate spatial datasets can be used to investigate whether certain urban characteristics trigger reports on rainfall-related incidents, which serve as a proxy to measuring flood impacts.

Multivariate spatial data analysis is widely applied in the field of ecology. In the field of community and landscape ecology, for instance, the main matrix of analysis is composed by sampled objects in the rows, and frequency or abundance of different species, or measurement of environmental variables, in the columns. Clustering, ordinations, and multiple regressions are part of the techniques used to analyze such matrices. Cluster analysis is used to classify associations between sampling sites, or between the composition of species inhabiting such sites. Ordination techniques, such as principal coordinates analysis (PCoA), are used to interpret community changes along environmental gradients in complex multivariate datasets. Regression analysis on singles species is also used to quantify possible relationships between species and environmental variables (Jongman et al., 1995).

While those methods are used by ecologists to characterize and model the environmental conditions in which species flourish, these techniques can be applied to open spatial data to explain occurrence of urban flooding impacts. A set of those methods is presented in this paper and their potential for flood risk analysis is discussed. Methods include spatial indexing, classification dendrograms, ordination via multidimensional scaling, and fitting of multiple regressions on the complaints.

This article is organized as follows: Sect. 2 explains the goals and procedures of the exploratory methods, results of the application of one of those methods are discussed in Sect. 3, and Sect. 4 draws conclusions and provides an outlook for the use of the described methods.

In order to explain how the methods presented in this paper can operate in big sets of open spatial data for the purpose of flood risk analysis, a number of publicly available datasets are considered. Datasets are collected for the case of a heavy rain event that hit Amsterdam on 28 July 2014, causing considerable flood damage. The data available includes socioeconomic, cadastral, and meteorological data. Reports about rainfall related incidents are also available, and can be used as indicators of urban flooding incidence. Data-sources and methods are described in the following subsections. An overview of data characteristics is shown in Table 1.

Rainfall intensity measurements are based on a system of two C-band Doppler
weather radars operated by the Royal Netherlands Meteorological Institute
(KNMI, 2013). Rainfall depths, observed at 1.5 km above the ground, are
provided with a temporal resolution of 5 min, on a grid of 1 km

Available socio-economic statistics include information about population
density, age, and income. These datasets are updated every year. They also
include housing characteristics and market-prices for the period 2012–2013.
Statistics are provided on a geospatial vector data format: they are
aggregated into hectare and 0.25 km

The municipality of Amsterdam provides platforms for registering citizen complaints about incidents in the city, including those related to urban flooding impacts. Reports are recorded by personnel from the municipality, and include a unique identifier, a timestamp, address of reported incidents, an initial classification into general types of rainfall-related incidents (e.g. blocked gully, flooded basement, overloaded sewer) and a short description of the reported incident. Municipalities can be requested to provide these reports for scientific research.

Data pre-processing starts with the definition of the spatial units of
analysis or “sample sites”. The selection of the shape and extent of each
site was done according to three purposes: having a comparable area in every
unit, facilitating geometric calculations and data re-arrangement, and
keeping a balance between fine spatial resolution and availability of data.
Using the grid of the radar data satisfies the stated purposes. Grid cells
are simple geometries that do not demand too complex computations during
spatial queries. As the Amsterdam area is around 200 km

A second step in the pre-processing is to perform spatial queries and averages in each of the variables listed in Table 1 to determine their mean status in each sampling unit. The queries are used to intersect realizations of the variables with their underlying cells. The average value of the intersected realizations is assigned to each cell. Given the size of handled data, the implementation of a spatial index is required to reduce computational demands. As it greatly reduces the number of intersections that need to be made during the sampling of variables realizations at different sites, R-Tree indexing (Guttman, 1984) was applied to the sample sites in this study. Finally, the resulting realization averages per cell are stored in a matrix in which rows represent sampling sites and columns each of the variables.

Before describing the rest of the techniques presented by this paper, it is worthy to make some considerations about the usage of the variables described in Sects. 2.1 to 2.3. Reports of rainfall-related incidents are taken as a proxy to urban flooding impacts. For this reason the occurrence of those reports are considered the response variable in this study. The rest of the variables are assessed as explanatory variables.

Sampled sites can be grouped in terms of their similarities using a cluster analysis. Groups produced by this type of analysis are characterized by the similarity of values of the different variables. The goal of this classification is to discretize areas that share urban conditions, and to explore whether some of the obtained groups are more prone to rainfall-related impacts. The grouping is done in the matrix obtained according to Sect. 2.4, excluding the column of reports from the matrix, and considering only the sites with observed reports of incidents.

Cluster analysis can be done via a hierarchical grouping that maximizes similarities within groups, and differences between groups. The result of a hierarchical grouping applied to the sites is a tree-like structure, or dendrogram, displaying how close sites are to each other in terms of the values of the multiple variables under analysis (Jongman et al., 1995; Legendre and Legendre, 2012).

Clustering requires the computation of a square similarity matrix indicating
how close are sites to each other in terms of all the considered variables.
Selecting a distance metric depends on the nature of data. For instance,
non-parametric Spearman's

The clustering process can be made by setting weights on the objects being
classified, and by establishing a priori similarity thresholds to define groups.
Without previous knowledge about differences on the importance of sampling
units or variables, the unweighted pair group method with arithmetic average
(UPGMA) is preferred over the weighted pair group (WPGMA) or

For the case of flood risk analysis, a likely outcome of a cluster analysis applied to different urban conditions would group similar sites into branches sharing urban characteristics such as population density, income, or infrastructure age. If such grouping also happens to differentiate groups of sites with remarkably different report incidence, which is not included in the clustering computation, then the environmental configurations with higher urban flooding risks can be discriminated. An expected result is that groups characterized by sites with higher urban densities, and thus bigger number of inhabitants, display higher incidence of reports (Gaitan et al., 2015). Even though the focus of this paper is to present an overview of methods to explore multivariate spatial datasets, results of the clustering test are discussed in Sect. 3.

Cluster analysis of sampled sites. Groping is made via UPGMA.

Apart from the clustering of sample sites, patterns in the distribution of contextual variables can be projected on a Cartesian space of reduced dimensionality using Principal Coordinates Analysis (PCoA). This allows us to observe which variables explain most of the data variation and which ones are redundant. The representation produced by a PCoA is similar to the more well-known Principal Components Analysis (PCA).

The two methods differ in the way in which synthetic axes are calculated. In PCA, the computation of eigenvectors (i.e. the Components) is done assuming that Euclidean distances can be calculated directly from input data, and that analyzed variables correlate linearly. In a set of data having types different than quantitative (e.g., ranked or frequency data), PCA cannot be directly applied. PCoA, on the other hand, can be computed using different distance indexes (see indices described in previous section). Those distances are projected in Cartesian planes, and the resulting distance between all pairs of projected points is used to compute eigenvalues and eigenvectors (Legendre and Legendre, 2012; Jongman et al., 1995).

PCoA can visualize whether the incidence of rainfall-related impacts follows one of the theoretical environmental gradients proposed by the Coordinates. If so, the next step is to look if any of the analyzed variables is mostly aligned along that gradient. For example, one could expect that rainfall intensity aligns with one of the Principal Coordinates. One might also expect that points of average population density and building market prices gradients are displayed close to each other, suggesting some collinearity and redundancy. The distance index used to calculate the PCoA must always be considered when interpreting the resulting eigenvectors: the relationship between two variables of mixed data type might imply that the relations are not as simple as a correlation between two quantitative variables.

Multiple regression explores relations between a response variable and several explanatory variables. Among the methods presented in this paper, it is probably the most widely known. Multiple regression aims at describing possible links between explanatory and response variables, with a certain margin of error: the method can be used for estimating parameters ranks in which a response variable is more likely to occur. It also indicates the contribution of different explanatory variables to the variance of the response variable.

Special attention must be given to the type of data being analyzed as the method varies depending on it. Quantitative data can be modeled with straight lines, parabolas and Gaussian curves, using least-squares regressions and analysis of variance (ANOVA). Regressions on presence-absence data are modelled with logit models, using chi-tests, maximum-likelihood principle, and Gaussian logit (Jongman et al., 1995).

Results of multiple regression analysis should present some agreement with ordination analysis: the variables explaining most of the response variability in multiple regression (e.g. rainfall intensity) should be the ones most closely aligned to the main eigenvectors. Disagreements can arise, though, depending on the distance indexes used in the PCoA. The total variance explained by the significant parameters modeled in the regression determines how much variability it is explaining, indicating whether additional parameters need to be explored. In other words, multiple regression checks for useful variables, and prompts for further exploration of variables if not enough response variance is explained.

After performing the aggregations described in Sect. 2.4, rainfall-related
incidents were observed in a total of 81 sampling units. The cluster
analysis was applied to these 81 sites. A first view of the resulting
dendrogram (see Fig. 1) reveals the structure of similarities between
analyzed sites. Being above 95 % of similarity, groupings at the leaves
indicate that values of variables do not change drastically from one km

If the dendrogram is cut at 85 % of similarities (see red, dashed
horizontal line in Fig. 1), seven groups are differentiable. Viewing them
from left to right, third and fourth group (enclosed in black boxes) present
a particularly high incidence of rainfall-related reports. The group in the
right box is the one including eight sites from the city center. On the
other hand, the group in the left box includes 24 sites from diverse city
locations. The city center group is highly differentiated from the rest, and
accounts for 80 incident reports: with 10 % of the sites it accounts for
24 % of the rainfall-related incidents. The left group box includes 127
reports which equal 37 % of incidents. The incidence of rainfall-related
impacts is aprox. 5 km

When the average conditions in the three mentioned groups are compared, it
is clear that the building age in the city center group (

This paper presented open spatial data sources and multivariate exploratory methods that are available for research on urban flooding risks. These methods consisted of spatial indexing, clusters analysis, PCoA, and multiple regression. Spatial aggregation and indexing of data allow us to compile a matrix of sites and variables, in which the heterogeneous original information, which has references about geographic locations, can be rearranged for analysis. Cluster analysis using UPGMA provides a characterization of urban areas under study. PCoA reduces the dimensionality of multiple, heterogeneous datasets, suggesting the presence of environmental gradients. It also indicates which of the variables from the available explanatory variables, explains most of data variability. Multiple regression fits models to the explanatory variables to explain response variables, providing measures of the amount of explained power. The application of a cluster analysis to the available data indicated that the incidence of rainfall-related incidents is higher in areas characterized by older infrastructure and higher population density. Such information is useful for designing and implementing proper adaptation measures against urban flooding. Results from cluster analysis can be complemented by applying a PCoA and a multivariate analysis to the data, which will be the matter of future research.

We kindly thank the city of Amsterdam, for providing their complaint register for the purpose of this study, and Climate-KIC, for providing financial support for this research.