Healthy Cities, A comprehensive dataset for environmental determinants of health in England cities

Environmental determinants of health refer to regional, national, and local environmental factors that influence human physical, chemical, and biological health, and all related behaviours. To ensure the comprehensive coverage of various environmental factors, we select basic, behavioural, built, and natural environment descriptors (see Fig. 1 for details). The generation of the target dataset requires heterogeneous data collection, processing, and aggregation, which transforms the input data sources in Table 1 to the unified format illustrated in Fig. 2. We first introduce the determination of geographical units for the target dataset, then discuss the detailed generation process of each subsection of the dataset in Fig. 1.

Table 1 Information of input datasets.
Fig. 2
figure 2

Example of data records in MSOA of Birmingham city. The color represents data category the record belongs to. For time series data, we showcase the first values.

Determining the geographical units

We select the city-of-interests according to the honour list of city status by the UK government28 and the Office for National Statistics (ONS) Geography definition of major towns and cities29, which captures the high status from both the cultural and economic perspectives. We further filter the cities with administrative power as lower tier local authorities (LTLAs), combining which we acquire 29 representative cities in England (see Table 2 for details).

Table 2 City-of-interests in our dataset.

Datasets from heterogeneous sources often have different geographies: administrative geography, census geographies, postal geography, etc. A unified, fine-grained unit is of great importance to merge these data and unmask the relationship between environmental factors and their health outcomes, so as to support region-level comparisons30. Therefore, we select middle layer super output areas (MSOAs) as the main geographical unit in our study, which is a fine-grained census division that has a mean population of around 7200. As an illustrative example, we visualize the MSOAs of Birmingham city with valid data records in Fig. 2. As a more aggregated point-of-view, we also provide city-level aggregations in our dataset.

To merge collected data in different geographies, we collect MSOA-city lookup table31 and postcode-MSOA lookup table32 from the ONS Geography. By filtering and merging the collected lookup tables according to the city list, we generate a unified geography lookup table as shown in Table 3, which contains 1039 MSOAs. Those identified MSOAs are referred to as the minimum spatial units for our following data processing from all sources, which is used in the following generation procedures to merge the data.

Table 3 Example of essential information of geography lookup table for the produced dataset.

Processing of health outcomes data

We formulate the health outcome of citizens for each region from three aspects: life expectancy, physical health, and mental health. For life expectancy data, we collect gender-specific life expectancy and healthy life expectancy in MSOA level from ONS33, then filter the regions according to the geography lookup table described in Table 3. For physical health, we consider 6 common non-communicable diseases in cities: asthma, cancer, dementia, diabetes, hyperlipidemia, hypertension and obesity. For mental health, we mainly consider depression, psychosis and related disorders in cities. To accurately assess the severity of these diseases, we collect fine-grained prescribing data from the National Health Service (NHS) Business Services Authority34, which serves as an informative data source to estimate the health status of citizens. It contains the drug code, drug quantity, and corresponding expenditure for each practice such as a general practitioner (GP), out-of-Hours service, or a hospital department. Specifically, we focus on expenditure records since they can be used to comprehensively evaluate the severity of diseases across different drugs. Considering the large quantity of the data, we use the Open Data Portal Application Programming Interface (API)35 to query the required information. We filter their corresponding drug codes for physical health and mental health through the British National Formulary (BNF)36. Then we generate the corresponding structured query language (SQL) request through the API to acquire the aggregated actual cost data of these diseases in the postcode level. Since the outbreak of SARS-CoV-2 virus at the end of 2019, COVID-19 has become the most influential communicable disease in urban spaces. We also consider COVID-19 as a representative communicable disease affecting the physical health. For the COVID-19 data, we collect the MSOA level time series from the UK government37, which contains the number of new cases within rolling 7-day periods. During the post process, we merge them into MSOA and city level according to the geography lookup table.

Processing of basic statistics data

The basic statistics data include the population, area, boundary and centroid of selected regions, providing essential information to understand the composition of urban spaces. Specifically, we collect the latest estimates of the usual resident population for MSOA level38, which is in mid-2020. We filter the population numbers of selected MSOAs38 and aggregate them to obtain the city population according to the geography lookup table. The up-to-date city boundary is defined in 201529, which corresponds to the census result of 2011. Thus, we collect the geographical boundary39,40 and the geographical lookup table31,32 of MSOA in their 2011 definition. We adopt the generalized boundary within 20 m error range in our dataset, which strikes a good balance between accuracy and data size. For the boundary data, we filter the MSOA boundary39 and city boundary29 accordingly, and save the polygons in GeoJSON format with the corresponding MSOA codes and city codes. We preserve the original coordinate system of WGS84 in the resulting files. The above boundary data contain the area information of each region, where we modify the data unit into km2 level. According to the population data and area data, we calculate the population density of each MSOA and city in our dataset. For the centroid data, we use the Python packet shapely to calculate the geometric centroids according to the above boundary of cities and MSOAs.

Processing of behaviour environment data

The venues in cities affect the behaviour of citizens in a subtle way, where researchers have demonstrated strong evidence that the availability of tobacco & alcohol22, open green spaces9,10,41, and medical resources42 affect the health outcomes. Here, we focus on the availability of tobacco, alcohol, physical exercise, health care services in a neighbourhood through point-of-interest (POI) data as important health-related behaviour factors. Specifically, we collect the SafeGraph Places Data Schema43, which contains more than 1.5 million records for the whole UK. We filter the POIs by their categories, which are in North American Industry Classification System (NAICS)44 2017 version. NAICS is a classification system developed by the US Census Bureau, which uses a numeric code up to 6 digits in length to hierarchically classify different venues. For tobacco availability, we filter the POIs with NAICS categories of Tobacco Stores and Grocery Stores. We also calculate alcohol availability by Drinking Places, Beer, Wine, Liquor Stores, and Grocery Stores. For physical exercise availability, we consider Fitness and Recreational Sports Centers, Nature Parks and Other Similar Institutions. For health care services availability, we consider Health and Personal Care Stores, Ambulatory Health Care Services, Hospital, Nursing and Residential Care Facilities. Finally, we calculate the availability indicators by the fraction of corresponding POI numbers and region population.

Processing of built environment data

Urban built environment, as an important determinant of health, shapes citizens’ physical activity and mental well-being45. In this study, we incorporate house price, building density, road network density, street view features, satellite features, and walkability to jointly describe the built environment of urban spaces.

We collect the median and mean house price data from ONS46,47,48,49, which include seasonally time series of MSOA level house prices from 1995 until now for both newly built and existing dwellings. It contains common house types such as detached houses, semi-detached houses, terraced houses, flats and manisonettes. Here, we extract the general indicator containing all sales and all house types for the selected regions in our study.

We collect the building information and road networks from OpenStreetMap50. To export large-scale map data, we use the bulk download service provided by Geofabrik51. We manually download the minimal subregion files that contain the city-of-interests, and use the Python packet pyrosm to extract the building information and road networks in interested cities and MSOAs by specifying corresponding boundary polygons. We count the number of buildings in each region, and calculate the building density by dividing it by the area size. For the road network, we filter the driving network, cycling network and walking network accordingly, and calculate the road density indicator by the ratio of total road length and the area size.

The availability of street view imagery provided by map platforms such as Google52 enables a new angle to observe and analyse the urban environment for the health outcomes for every citizen53,54. For the street view image data, we sample the urban spaces into 100 m × 100 m grids and download the 360° images from Google Map52, which generates 784 thousand images. With the recent advantages of deep learning technology, automatic feature extraction for large-scale image data is possible. In our study, we adopt the state-of-the-art semantic segmentation model ViT-Adapter55 based on vision transformer technology to automatically infer the objects in the street view images, which provides high-accuracy pixel-level classification to the input images. Specifically, we use the official implementation56 provided by the authors trained on Cityscapes dataset57 for our street view images. It recognizes 19 different objects in the image, which are shown in Table 4. We calculate the pixel-level percentage of each objects, and aggregate them in the MSOA and city level to capture the visual semantics of neighbourhood features.

Table 4 Recognized objects for street view and satellite view images.

The satellite view imagery is obtained from Esri World Imagery58 according to the method described in59 and its corresponding code implementation60. Specifically, we collect 0.6 m resolution satellite image data tiles covering all the city-of-interests. Then we train the ViT-Adapter55 model on LoveDA dataset61 to extract the 7 labeled objects as features from the collected satellite images. Like the street view images, we aggregate the inference result images according to the MSOA and city boundaries, and calculate the pixel-level percentage of each annotated object.

Walkability is a long-standing indicator in the field of urban planning, which evaluates the mixed-use of amenities to quantify how walking-friendly a neighbourhood is62. In this study, we focus on the health benefit of walkability according to30, which defines walkability as the average z-score of population density, intersection density and a daily living score. We calculate the intersection density through the above OpenStreetMap walking road network data, where we use Python packet shapely to determine whether two roads have any intersection. We summarize the number of intersections in each region, and divide by the corresponding area size as the intersection density. For the daily living score, we consider the density of daily living POIs in each region. According to30, we define daily living POIs in the following categories: Grocery Stores, Nature Parks and Other Similar Institutions, Air Transportation, Rail Transportation, Water Transportation, Transit and Ground Passenger Transportation, and calculate the daily living score by dividing the total number of these POIs with the area size. We normalize the above three indicators according to the following equation

$$Z_\ast =\fracx_\ast -\mu _\ast \sigma _\ast ,$$


where x* could be the population density, intersection density or daily living score, and μ, σ are the mean and standard variation of x*. Finally, we derive the walkability score by taking the average of normalized indicators.

Processing of natural environment data

Exposure to polluted air is considered a major health challenge for citizens63,64,65. The air quality data is obtained from UK Air66, which is organized by the Department for Environment Food & Rural Affairs (DEFRA). We focus on the Automatic Urban and Rural (AURN) monitoring network, which is the UK’s largest automatic monitoring network for common air pollutants. Specifically, we collect the daily mean records of nitrogen oxides as nitrogen dioxide, PM2.5, and PM10 particulate matter as the air pollution indicators in our dataset. The collected data are available at the station level. We manually select the stations and the corresponding pollution data according to the interactive map67 and station information68. Specifically, for cities with multiple stations, we preserve all the observations in our data.

Climate issue ties tightly with the well-being of all the people69,70,71. Recently, new evidence shows that worsening climate is correlated with a variety of health outcomes, including insufficient nutrition, pandemic outbreaks, and increasing of anxiety and depression72,73. To evaluate how the changing weather affects the health outcome in each region, we collect the weather data from HadUK-Grid maintained by Met Office74, which is a collection of gridded climate variables in high spatial resolution. We collect temperature, precipitation, relative humidity, sunshine duration, snow lying days, and wind speed as the weather features. During the post process, we align the grid data of weather into MSOA and city level. Specifically, we use Python packet h5netcdf to read the weather data, which are provided in NetCDF format. Then we calculate the distance between the gridded data point with the geometric centre of each region by Python packet haversine, and match the nearest one as the target. Considering the size of MSOA and cities, we use 1 km × 1 km resolution data to match each MSOA, and 12 km × 12 km data to match each city.


Leave a Reply

Your email address will not be published. Required fields are marked *