There is publicly accessible data that describes the socioeconomic characteristics of a geographic location. In Australia where I live, the Government through the Australian Bureau of Statistics (ABS) collects and publishes individual and household data on a regular basis regarding income, occupation, education, employment and housing at the area level. Some examples of published data points include:
- Percentage of people with relatively high/low income
- Percentage of people classified as managers in their respective occupations
- Percentage of people without formal educational level
- Percentage of unemployed people
- Percentage of properties with 4 or more bedrooms
While these data points appear to focus largely on individual people, they reflect people's access to material and social resources, and their ability to participate in society in a particular geographic area, which ultimately informs the advantages and socioeconomic disadvantages of this area.
Given these data points, is there a way to derive a score that ranks geographic areas from most to least advantaged?
The objective of obtaining a score can be formulated as a regression problem, where each data point or feature is used to predict a target variable, in this scenario, a numerical score. This requires that the target variable be available in some cases to train the predictive model.
However, since we don't have a target variable to begin with, we may need to approach this problem another way. For example, under the assumption that each geographic area is different from a socioeconomic point of view, can we try to understand which data points help explain the most variation, thereby deriving a score based on a numerical combination of these data points? data?
We can do exactly that using a technique called Principal Component Analysis (PCA), and this article demonstrates how!
ABS publishes data points indicating the socioeconomic characteristics of a geographic area in the “Data Download” section of this Web page, in the “Data Cube of Standardized Variable Proportions”(1). These data points are published in the Statistical area 1 (SA1) level, which is a digital border that segregates Australia into population areas of approximately 200 to 800 people. This is a much more granular digital boundary compared to the ZIP code or states' digital boundary.
For demonstration purposes in this article, I will derive a socioeconomic score based on 14 of the 44 published data points provided in Table 1 of the data source above (I will explain why I select this subset later). ). These are :
- INC_LOW: Percentage of people living in households with declared equivalent annual income between $1 and $25,999 AUD
- INC_HIGH: Percentage of people with declared annual equivalent income greater than $91,000 AUD
- UNEMPLOYED_IER: Percentage of people aged 15 and over who are unemployed
- HIGH BED: Percentage of occupied private properties with four or more bedrooms
- HIGH MORTGAGE: Percentage of occupied private properties paying a mortgage greater than $2,800 AUD per month
- LOW RENT: Percentage of occupied private properties paying rent less than $250 AUD per week
- PROPERTY: Percentage of private properties occupied without a mortgage
- MORTGAGE: Percentage of private properties occupied with a mortgage
- GROUP: Percentage of occupied private properties that are private properties occupied by groups (e.g., apartments or units)
- SOLITAIRE: Percentage of occupied properties that are private properties occupied by a single person
- ROOM: Percentage of occupied private properties requiring one or more additional bedrooms (based on Canada's National Occupancy Standard)
- NOCAR: Percentage of private properties occupied without cars
- SINGLE PARENT: Percentage of single-parent families
- UNINCORP: Percentage of properties with at least one person who is a business owner
In this section, I will analyze Python code to obtain a socioeconomic score for an SA1 region in Australia using PCA.
I will start by loading the required Python packages and data.
## Load the required Python packages### For dataframe operations
import numpy as np
import pandas as pd
### For PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
### For Visualization
import matplotlib.pyplot as plt
import seaborn as sns
### For Validation
from scipy.stats import pearsonr
## Load datafile1 = 'data/standardised_variables_seifa_2021.xlsx'
### Reading from Table 1, from row 5 onwards, for column A to AT
data1 = pd.read_excel(file1, sheet_name = 'Table 1', header = 5,
usecols = 'A:AT')
## Remove rows with missing value (113 out of 60k rows)data1_dropna = data1.dropna()
An important cleaning step before performing PCA is to standardize each of the 14 data points (features) to a mean of 0 and standard deviation of 1. This is mainly to ensure the loadings assigned to each feature by PCA (think them as indicators of how important a feature is) are comparable across features. Otherwise, more emphasis or greater weight may be given to a characteristic that is not actually significant or vice versa.
Note that the ABS data source cited above already has the standardized characteristics. That said, for a non-standardized data source:
## Standardise data for PCA### Take all but the first column which is merely a location indicator
data_final = data1_dropna.iloc(:,1:)
### Perform standardisation of data
sc = StandardScaler()
sc.fit(data_final)
### Standardised data
data_final = sc.transform(data_final)
With standardized data, PCA can be performed in just a few lines of code:
## Perform PCApca = PCA()
pca.fit_transform(data_final)
PCA aims to represent the underlying data by principal components (PC). The number of PCs provided in a PCA is equal to the number of standardized features in the data. In this case, 14 PCs are returned.
Each PC is a linear combination of all standardized features, differentiated only by their respective standardized feature loadings. For example, the image below shows the loads assigned to the first and second PC (PC1 and PC2) by function.
With 14 PCs, the following code provides a visualization of how much variation each PC explains:
## Create visualization for variations explained by each PCexp_var_pca = pca.explained_variance_ratio_
plt.bar(range(1, len(exp_var_pca) + 1), exp_var_pca, alpha = 0.7,
label = '% of Variation Explained',color = 'darkseagreen')
plt.ylabel('Explained Variation')
plt.xlabel('Principal Component')
plt.legend(loc = 'best')
plt.show()
As illustrated in the results visualization below, Principal Component 1 (PC1) accounts for the largest proportion of variance in the original data set, with each subsequent PC explaining less of the variance. To be specific, PC1 explains around. 35% of the variation within the data.
For the purposes of the demonstration in this article, PC1 is chosen as the only PC to derive the socioeconomic score, for the following reasons:
- PC1 explains a sufficiently large variation within the data in relative terms.
- While choosing more PCs potentially explains (marginally) more variation, it makes it difficult to interpret the score in the context of socioeconomic advantages and disadvantages of a particular geographic area. For example, as shown in the image below, PC1 and PC2 can provide conflicting narratives about how a particular characteristic (e.g. 'INC_LOW') influences the socioeconomic variation of a geographic area.
## Show and compare loadings for PC1 and PC2### Using df_plot dataframe per Image 1
sns.heatmap(df_plot, annot = False, fmt = ".1f", cmap = 'summer')
plt.show()
To obtain a score for each SA1, we simply multiply the standardized portion of each trait by its PC1 loading. This can be achieved by:
## Obtain raw score based on PC1### Perform sum product of standardised feature and PC1 loading
pca.fit_transform(data_final)
### Reverse the sign of the sum product above to make output more interpretable
pca_data_transformed = -1.0*pca.fit_transform(data_final)
### Convert to Pandas dataframe, and join raw score with SA1 column
pca1 = pd.DataFrame(pca_data_transformed(:,0), columns = ('Score_Raw'))
score_SA1 = pd.concat((data1_dropna('SA1_2021').reset_index(drop = True), pca1)
, axis = 1)
### Inspect the raw score
score_SA1.head()
The higher the score, the more advantage an SA1 will have in terms of access to socioeconomic resources.
How do we know that the score we got above was even remotely correct?
For context, the ABS actually published a socioeconomic score called Economic Resources Index (IER)defined on the ABS website as:
“The Economic Resources Index (IER) focuses on the financial aspects of relative socioeconomic advantages and disadvantages by summarizing variables related to income and housing. The IER excludes the education and occupation variables because they are not direct measures of economic resources. It also excludes assets such as savings or assets that, although relevant, cannot be included as they are not included in the Census.”
Without revealing the detailed steps, the ABS stated in its Technical document that the IER was obtained using the same characteristics (14) and methodology (PCA, PC1 only) that we had performed previously. That is, if we got the scores right, they should be comparable to the published IER scores. here (“Statistical Area Level 1, Indices, SEIFA 2021.xlsx”, Table 4).
Since the published score is standardized to a mean of 1000 and a standard deviation of 100, we begin validation by standardizing the raw score in the same way:
## Standardise raw scoresscore_SA1('IER_recreated') =
(score_SA1('Score_Raw')/score_SA1('Score_Raw').std())*100 + 1000
For comparison, we read in the IER scores published by SA1:
## Read in ABS published IER scores
## similarly to how we read in the standardised portion of the featuresfile2 = 'data/Statistical Area Level 1, Indexes, SEIFA 2021.xlsx'
data2 = pd.read_excel(file2, sheet_name = 'Table 4', header = 5,
usecols = 'A:C')
data2.rename(columns = {'2021 Statistical Area Level 1 (SA1)': 'SA1_2021', 'Score': 'IER_2021'}, inplace = True)
col_select = ('SA1_2021', 'IER_2021')
data2 = data2(col_select)
ABS_IER_dropna = data2.dropna().reset_index(drop = True)
Validation 1—PC1 Loads
As shown in the image below, comparing the PC1 load derived above with the PC1 load published by ABS suggests that they differ by a constant of -45%. Since this is simply a scale difference, it does not affect the derived scores which are standardized (to a mean of 1000 and standard deviation of 100).
(You should be able to check the 'Derived (A)' column with the loads of PC1 in Image 1.)
Validation 2—Score distribution
The following code creates a histogram for both scores, the shapes of which appear almost identical.
## Check distribution of scoresscore_SA1.hist(column = 'IER_recreated', bins = 100, color = 'darkseagreen')
plt.title('Distribution of recreated IER scores')
ABS_IER_dropna.hist(column = 'IER_2021', bins = 100, color = 'lightskyblue')
plt.title('Distribution of ABS IER scores')
plt.show()
Validation 3: IER score by SA1
As a final validation, let's compare the IER scores according to SA1:
## Join the two scores by SA1 for comparison
IER_join = pd.merge(ABS_IER_dropna, score_SA1, how = 'left', on = 'SA1_2021')## Plot scores on x-y axis.
## If scores are identical, it should show a straight line.
plt.scatter('IER_recreated', 'IER_2021', data = IER_join, color = 'darkseagreen')
plt.title('Comparison of recreated and ABS IER scores')
plt.xlabel('Recreated IER score')
plt.ylabel('ABS IER score')
plt.show()
A diagonal straight line, as shown in the output image below, supports that the two scores are largely identical.
In addition to this, the following code shows that the two scores have a correlation close to 1:
The demonstration in this article effectively replicates how the ABS calibrates the IER, one of four socioeconomic indices it publishes, which can be used to classify the socioeconomic status of a geographic area.
Taking a step back, what we have essentially achieved is a reduction in the dimension of the data from 14 to 1, losing some of the information conveyed by the data.
Dimensionality reduction technique, such as PCA, is also commonly seen to help reduce high-dimensional space, such as text embeddings, to 2 or 3 principal (displayable) components.