Coupling physically based and data-driven models for assessing freshwater inflow into the Small Aral Sea

The Aral Sea desiccation and related changes in hydroclimatic conditions on a regional level is a hot topic for past decades. The key problem of scientific research projects devoted to an investigation of modern Aral Sea basin hydrological regime is its discontinuous nature – the only limited amount of papers takes into account the complex runoff formation system entirely. Addressing this challenge we have developed a continuous prediction system for assessing freshwater inflow into the Small Aral Sea based on coupling stack of hydrological and data-driven models. Results show a good prediction skill and approve the possibility to develop a valuable water assessment tool which utilizes the power of classical physically based and modern machine learning models both for territories with complex water management system and strong water-related data scarcity. The source code and data of the proposed system is available on a Github page (https://github.com/SMASHIproject/IWRM2018).


Introduction
The Aral Sea and its basin are among the highly recognizable examples of significant environmental changes which took place in the Central Asia during the last decades (Izhitskiy et al., 2016;Micklin, 2007;Raskin et al., 1992;Zavialov et al., 2003).Induced by river runoff exploitation across huge irrigation systems the Aral Sea level has significantly decreased and run irreversible ecosystem and water balance shifts (Zmijewski and Becker, 2014).Nowadays the Small Aral Sea has a limited (by the Kokaral Dam) hydrological connection with dying southern sea basins and tends to stay a separate part under current social and political situation in the region.It is extremely important to devote scientific attention to this region as a real live example of the humaninduced impact on water balance and its response (Immerzeel and Bierkens, 2012).
The main volume of the freshwater inflow into the Small Aral Sea is formed on the Syr Darya river basin which is among the largest and highly vulnerable river basins in the Central Asia.There are thirteen large reservoirs and much local water management related installations on the Syr Darya river and its tributaries which utilize full freshwater poten-tial for irrigational, industrial, recreational, and social needs.This complex structure of water management system coupled with the total absence of data describes its functioning is a challenge for any approach directed to the accurate assessment of the Small Aral Sea freshwater budget formation and evolution across the basin (Lutz et al., 2012a;Raskin et al., 1992;Sorg et al., 2014).
There are three main categories of scientific literature devoted to the identification of modern water balance shifts in the Aral basin.The first group accumulates research directed to large-scale heat and water flux changes assessment based on remote sensing, climate modeling and reanalysis data for a whole basin area (López et al., 2017;Shi et al., 2014;Zmijewski and Becker, 2014).These investigations help us to identify patterns and key factors affect long-term hydrological changes and its trends (geographical approach), but cannot be easily scaled for providing quantitative predictions.The second group focuses attention mostly on the upstream Aral Sea basin area (mountainous zone and Ferghana Valley) by the reason of high-altitude glaciers and the biggest reservoirs presence here.Pereira-Cardenal et al. (2011), Siegfried et al. (2012), Published by Copernicus Publications on behalf of the International Association of Hydrological Sciences.Hagg et al. (2006Hagg et al. ( , 2007)), Gan et al. (2015) and Lutz et al. (2012b) modeled runoff in glacierized catchments and its contribution into underlying reservoirs inflow using conceptual and physically based models (NAM, HBV-ETH, OEZ, SWAT, AralMountain).Apel et al. (2017) evaluated the skill of simple statistical models for seasonal runoff forecast in this region.Radchenko et al. (2017) examined historical runoff for 18 river basins in Ferghana Valley using HBV-light model and estimated projected changes in streamflow characteristics according to the A1B climatic scenario for these basins.For the extensive review of hydrological modeling studies in glacierized catchments of Central Asia please refer to Chen et al. (2017).The third (and the least) group of papers being conducted on developing end-to-end hydrological modeling system for the whole Small Aral Sea basin.A simplified approach for assessing annual freshwater inflow based on hypothetical and general circulation model based scenarios of future climate temperature and precipitation has been applied in Shibuo et al. (2007) and Jarsjö et al. (2012) using Porflow model without any parameters calibration.The most comprehensible routine for model-based assessment of water balance components of Syr Darya river basin was proposed in Lutz et al. (2012a) and utilizes coupling of conceptual runoff formation model AralMountain (Lutz et al., 2012b) with Water Evaluation And Planning model (WEAP) which has been already implemented for the former Aral Sea basin in 1989 (Raskin et al., 1992).
In presented work we have tried to combine best practices in an existing scientific literature and modern advances in the field of machine learning to develop continuous hybrid hydrological model for investigating both runoff generation processes using physically based models and runoff transformation through one of the most complex water management systems in the world using machine learning algorithms and models.With our research, we want to fill a modern gap in developing a continuous runoff prediction system for the entire Syr Darya river basin domain using a combination of state-of-the-art modeling techniques.Our research does not pretend to cover the problem of freshwater inflow predictions in the Small Aral Sea in high details, but it is an attempt to map the efficiency level of the runoff prediction system which has been built only on open data sources.

Study area
The main part of the Small Aral Sea basin (Fig. 1) is occupied by the Syr Darya river and its tributaries which contribute around 40 km 3 of freshwater inflow annually (Radchenko et al., 2017).About 70 % of the runoff of the Syr Darya river basin originates in the Kyrgyzstan mountain ranges and the main contribution of this volume corresponds to Ferghana Valley river basins (Belyaev, 1995;Radchenko et al., 2017).In our research, we have selected 24 basins which run to the Ferghana Valley as the main source of hydrological insights and information about runoff generation in the freshwater formation zone of the Small Aral Sea (Fig. 1).These basins are highly contrasting in geographical and hydroclimatic conditions, and cover a range of areas from 150 to 24 000 km 2 .For a detailed geographical description of Ferghana Valley river basins please refer to Radchenko et al. (2017).

Runoff and meteorological forcing data
Observed runoff data for selected basins were provided by the Global Runoff Data Centre (GRDC; http://www.bafg.de/GRDC/).Only for 2 basins of 24 there were daily observed runoff time series, therefore in our work we used only monthly observations for holding methodological consistency.Runoff data availability is the main limit for developing and validation of our methodology by the reason of the majority of available observations lie in the interval from 1975 to 1985.For the modern studies related to contemporary water resources assessment on vast territories, it is essential to use global gridded data products as the only spatial and temporal continuous source.For this reason, all models were driven by precipitation and temperature data from the ERA-40 reanalysis (1957-2002, 0.5 • spatial resolution, http://apps.ecmwf.int/datasets/,Uppala et al., 2005).Potential evapotranspiration is another required forcing variable for all models and it was derived based on Oudin et al. temperature-based equation (Oudin et al., 2005).

Hydrological models
The HBV (Hydrologiska Byråns Vattenbalansavdelning, in Lindström et al., 1997), the GR4J (modele du Genie Rural a 4 parametres Journalier in Perrin et al., 2003), and the SIMHYD (in Chiew et al., 2009) models were used in this study according to its wide implementation for different hydrological applications, flexibility, proven effectiveness for runoff predictions in different geographical conditions, and numerous successful applying for prediction in ungauged basins related studies (Beck et al., 2016;Oudin et al., 2008;Reichl et al., 2009).All listed models have a typical conceptual, bucket type with lumped parameters representation of runoff formation processes at basin scale with daily timestep.GR4J and SIMHYD models have been updated by adding Cema-Neige snow module (Valéry et al., 2014).Models' source code is freely available as a component of LHMP tool (Lumped Hydrological Models Playground, http: //github.com/hydrogo/LHMP,Ayzel, 2016).Models' parameters were automatically calibrated by maximizing Nash-Sutcliffe criteria (NSE, Nash and Sutcliffe, 1970) for a whole period of observations using differential evolution algorithm

Machine learning models
For runoff modeling, we have used different machine learning models starting from simple MLP (Multiple Linear Regression) and wide-based decision tree ensembles of ETR (Extra Trees Regression) to the most complicated depth-based tree ensembles of LGB (Light Gradient Boosting machine) and XGB (eXtreme Gradient Boosting machine).MLP and RFR were implemented by using the Scikit-learn package (https://github.com/scikit-learn/scikit-learn, Pedregosa et al., 2011), LGB was taken from the LightGBM package (https://github.com/Microsoft/LightGBM, Zhang et al., 2017), LGB was taken from XGboost package (https://github.com/dmlc/xgboost,Chen and Guestrin, 2016).Machine learning model parameters tuning requires a lot of expertise and experimentation and cannot be resolved automatically because of high computational complexity (Snoek et al., 2012), therefore we had calibrated required parameters manually.For deriving predictions in ensemble manner and approaching realism in the model setting we have used leave-one-out cross-validation technique for machine learning model performance assessment (Ayzel, 2017;Hastie et al., 2001) -as a result, we evaluated model performance on every observational point independently and produced ensemble realization according to the amount of runoff observation we use.This setting provides us most comprehensive evaluation protocol for machine learning models and uncertainties related to models' structures.

Feature engineering
Feature engineering is an essential part of any routine of machine learning model developing.The general idea of feature engineering is to map already existed features of data to the new representation (dimension).Two (among others) classical implementations of these techniques are extending data with adding some features shifted in time (further referred as LAGS) and shrinking data dimensionality with principal component analysis (PCA) orthogonal transformation algorithm (Hastie et al., 2001).We have tested performances of our machine learning models with default input features, using LAGS and PCA separately and in a coupled setting, and then select the best combination in term of runoff predictions accuracy.

Workflow representation
The main idea of the presented work was to extract the value using all freely available hydrological information available for the Small Aral Sea basin.On the first stage of our research workflow (Fig. 2) we have calibrated parameters of three hydrological models for 24 rivers run to Ferghana Valley.During the calibration stage, every model had been running at daily temporal resolution then predicted runoff was aggregated at monthly scale for consistency with observational data for loss function calculation.On the second stage, we have implemented common spatial proximity based model parameters regionalization technique (Oudin et al., 2008) for transferring optimal sets of model parameters to meteorological forcing grid cells centroids.On the third stage we have run our models in a grid cell wise mode -for computing runoff in every grid cell in previously delineated formation zone (Fig. 1).As a result we have developed daily gridded multi-model runoff database for the Small Aral Sea proc-iahs.net/379/151/2018/Proc.IAHS, 379, 151-158, 2018 runoff formation zone which serves us as additional input data source for runoff modeling using machine learning models: for the first gauge in our cascade on the Syr Darya river -Kal -we have used both gridded meteorological and formation zone runoff forcing as input, the same for the next gauge in a cascade -Bekabad -but with added mean ensemble modeled runoff realization from Kal.For the remaining two gauges in a cascade (Tyumen Aryk and Kazalinsk), we used only meteorological forcing and mean ensemble modeled runoff realization from overlying gauge in a cascade.

Results and discussion
Model calibration results differ from model to model and from different complexity of optimization algorithm.Only one setting with HBV model and the most computationally expensive realization of differential evolution algorithm (number of iteration equals 25) provides positive values of NSE for every single basin (Fig. 3) and we have decided to use only this set-up for further investigations.Only five of selected basins have an NSE less than 0.45 -all of them (GRDC ids: 2916590, 2916660, 2916665, 2916670, 2916810) are located on a north exposition of Alay range.These low efficiencies can be explained by errors in GRDC observational runoff data or errors in basins' metadata (wrong coordinates of outlets, basin areas) which are quite hard to detect and check in the open literature and web sources.Inter-comparison of obtained modeling results with different studies (Lutz et al., 2012b;Pereira-Cardenal et al., 2011;Radchenko et al., 2017;Siegfried et al., 2012) shows high consistency among different approaches for modeling water balance in the upper and mountainous part of the Syr Darya river.Therefore we showed positive value of using freely available data sources for water balance modeling in the study area and tried to transfer this value to the gridded runoff database of the Small Aral Sea (runoff) formation zone using the most robust way of model parameters region- alization (Ayzel et al., 2017).Using developed gridded runoff dataset for extracting runoff realization for 24 selected basins in a semi-distributed manner shows good consistency with realizations were produced by lumped model setting with optimal parameters.
There is no silver bullet in machine learning field in taking a priori decision about the best data preprocessing routine, the best model, the best validation technique, and the best measure of failure (or success) of the proposed approach.In our research, we have tried to investigate the most widespread solutions for tackling regression problems in machine learning using different state-of-the-art techniques.Results (Table 1) show a good efficiency of different machine learning models to predict monthly runoff alongside a cascade of gauges on the Syr Darya river.The high variance between models' efficiencies from gauge to gauge is explained by the various complexity of water management infrastructure and runoff formation complexes located between those gauges.The worst results for Kal and Kazalinsk gauges and the best efficiency for Tyumen Aryk gauge highly correspond with a complexity of runoff formation/transformation processes we want to map with our input data.The only inflow from the upper Bekabad gauge is enough for robust mapping to a runoff in Tyumen Aryk using simple linear regression  Ensemble runoff predictions produced by machine learning models (Fig. 4) depict significant rate of model-related uncertainties which highly correlates with model complexity.This highlights a statement "the simpler -the better" regarding scientific model robustness issues, but we have to mention that high prediction uncertainties are fair pay for the ability of complex model map input features to relevant output.It is also clear that obtained efficiency correlates with overall complexity of observed system -for Kal gauge station (Fig. 4a) wide range of overlying tributaries and water management rules on them significantly contribute to comproc-iahs.net/379/151/2018/Proc.IAHS, 379, 151-158, 2018 plexity of processes we have to consider, the same is relevant for the gauge station in Kazalinsk which affected by many, often fuzzy and unclear water management practices (Fig. 4d).This result is also confirmed by the complexity of preprocessing routine -for the simplest cases (Bekabad and Tyumen Aryk gauge stations) we do not need to implement either PCA or LAGS for mapping features for a different dimension.
There is about 100 km from Kazalinsk gauge station to the actual Syr Darya delta, and there are a lot of channels, ponds, and other water management infrastructure units (e.g. the Aklak water regulation station) which can affect total freshwater inflow in the Small Aral Sea basin, but for consistency with previous studies (Lutz et al., 2012a;Raskin et al., 1992) we consider the equality between observed runoff in Kazalinsk and freshwater inflow to the sea.Only a brief look on the observed runoff time series in Kazalinsk (Fig. 4d) gives any researcher a clear representation of high complexity of runoff formation system behavior here -we can only detect simple seasonal pattern with maximum water availability during winter, but it is impossible to generalize remaining runoff amplitude according to natural reasons.Nevertheless, XGB and ETR models utilize this complexity well due to their native algorithmic structures based on simple binarized decision rules which try to mimic decision-making process takes place in many real-life situations.Despite the clear attraction of observations to lower and upper boundaries of our prediction interval which may identify unstable system behavior, there is an obvious correlation between observed and modeled runoff.
Despite the limited observed runoff data availability for this region (mainly for [1975][1976][1977][1978][1979][1980][1981][1982][1983][1984][1985] which was the main constraint in implementing comprehensive routines for proposed methodology validation, obtained machine learning model-based ensemble realizations of freshwater inflow into the Small Aral Sea for the period of 1958-2002 (alongside the forcing data availability) could form the basis for further "Soviet-driven water management" scenario predictions which help us better understand modern shifts in water resources distribution in post-Soviet time.

Conclusions
The complex structure of the Small Aral Sea basin water management system coupled with the total absence of data describes its functioning is a challenge for any approach directed to the accurate assessment of the freshwater budget formation and evolution across the basin.Our work shows the possibility to tackle these challenges by coupling hydrological models with the state-of-the-art machine learning techniques.In detail, we have evaluated the significant value of using physically based models for runoff predictions in ungauged upper part of the Syr Darya river for developing gridded runoff database which can be used as an additional fea-ture for machine learning model in a coupled setting.Results show a positive skill and a high flexibility of the proposed methodology, and in our perspective, it can be used widely as a baseline approach for water balance research studies in arid, ungauged areas, with complex water management system and strong water-related data scarcity.
We understand that an equality between freshwater inflow into the Small Aral Sea and observed runoff in Kazalinsk is quite a rough assumption, and in the further studies, we will try to assess real inflow by coupling simple seawater balance model to our existing modeling system.
The code and data we have developed are totally open and freely accessible.We hope that this supports reproducibility of our research and provides easy access to the community to test, criticize, or apply our findings.frame of the SMASHI project (http://smashiproject.github.io)and was funded by the Russian Foundation for Basic Research (RFBR), project 17-05-01175 A. The part of presented study related to the developing, adapting and implementing of conceptual hydrological model was financially supported by the Russian Science Foundation (grant number 16-17-10039).The Global Runoff Data Centre (GRDC) is gratefully acknowledged for providing observed runoff data.

Figure 1 .
Figure 1.The Small Aral Sea basin and selected river basins.

Figure 3 .
Figure 3. Boxplot of NSE for formation zone basins.

Figure 4 .
Figure 4. Predictions of machine learning model ensembles.