The integration of sustainability concerns into financial products and financial regulation is dependent on data. Without underlying data neither “Net-Zero transition” Exchange Traded Funds nor central bank stress tests of climate risks could exist.
However, sustainability data is not simply “out there” and readily available. Instead, it is shaped amongst other by historical legacies and biases, market dynamics, and regulation. This project investigates conceptually and empirically how these and other factors constitute a Political Economy of Sustainable Finance Data. This political economy conditions which sustainability aspects financial institutions can assess and govern and which ones remain unseen.
Contributions to this project deal amongst other with:
- Mergers and acquisition in the sustainable finance data market showing geographical imbalances and concentration effects
- The lasting impact of regulatory choices and early market practices for establishing “corporate-centric” sustainability data as the norm at the expense of geographic data aggregations
- The assessment of regulatory debates about scope and format of sustainability reporting from a data quality and data utility perspective
- The mapping of data sources and data transformations that occur at various stages in the context of sustainable finance data as well as of the organizations involved in these
- A critical appraisal of indicators, thresholds, aggregation methods and other data transformation methods that are employed to evaluate sustainability risks and impacts
- The creation of gold standard benchmark data for sustainability reporting as well as of alternative experimental datasets and indicators that are spatially embedded
- The automatized extraction of georeferenced asset-level information from corporate reporting through Natural Language Processing
Mergers and acquisition in the sustainable finance data market showing geographical imbalances and concentration effects
The lasting impact of regulatory choices and early market practices for establishing “corporate-centric” sustainability data as the norm at the expense of geographic data aggregations
The automatized extraction of georeferenced asset-level information from corporate reporting through Natural Language Processing
Even though there is no mandatory reporting of georeferenced sustainability information at scale, companies may still disclose data concerning a particular location in other contexts (e.g. materiality and risk assessments, acquisitions, divestments, capital expenditure, new technologies). These disclosures are, however, “hidden” in unstructured texts, tables and graphs. To get an idea in which contexts and how often companies talk about their geographically specific assets in reporting, we apply a Named Entity Recognition pipeline with a dependency parser to corporate reports in PDF format. The pipeline extracts sentences that combine mentions of Geopolitical Entities (cities, regions, nation-states) with mentions of customizable asset keywords (e.g. plant, factory, office).
Further analyzing these extracted asset mentions allows researching an array of questions including whether company and report attributes influence the frequency of geographic disclosures, what geographies are most prevalent for a given sample of companies and which topics are discussed frequently in the context of individual assets. Moreover, by linking the disclosed geoinformation with other spatial datasets one can investigate how complete corporate disclosures and to what extent they feature omissions and biases. Moreover, one can assess the geographical alignment of corporate disclosures on sustainability-related topics and academic sources such as the planetary materiality framework
Information extraction from and data quality issues in corporate sustainability reporting
In the absence of comprehensive registers on companies’ environmental impacts, compiling datasets from unstructured corporate sustainability reports has emerged as a second-best solution. This information extraction task can be achieved faster and at greater scale by deploying Natural Language Processing (NLP) techniques including the integration of Large Language Model (LLM) into information retrieval systems.
When deploying such systems to generate datasets one must, however, take two dimensions of data quality into account. Firstly, one has to check how good the information extraction system is in finding relevant data from the documents and excluding irrelevant values. We study this dimension in the context of Greenhouse Gas reporting from companies by comparing LLM extracted values against a benchmark dataset that was created by human experts in several rounds. Secondly, one has to check how accurate the values disclosed in the reports are in measuring the respective environmental aspect. This step becomes necessary as experiences and efforts deployed to disclose data like Greenhouse Gas emissions vary widely between companies. To get an idea about the quality of the reported value, we propose a typology of 30 interrelated errors and issues of emissions reporting (see figure) that can be inferred from the contextual information in the reports.