Considering the carbon impact of Big Data

Considering the carbon impact of Big Data
  • Blog post
  • October 20, 2022

Gregor Kronenberger, Dr. Matthias Schlemmer, and Ruth Melches

Big Data and Artificial Intelligence can support us in one of the major quests of the 21st century - solving the problem of climate change. However, it is becoming more and more obvious that data collection and storage can also itself be a part of the problem. The energy consumption of data centers in the European Union could potentially reach a staggering 3.21% of total electricity demand in 2030. With rising energy costs and increasing transparency requirements under ESG regulations, this topic is becoming increasingly relevant not just on a macro-, but also on a micro-economic level, and executives need to rethink their data management.

Why we need to talk about decarbonization in the Big Data context

How many emails have you and/or your team written today? A fairly simple question, but your answer reveals a lot of information, especially when aggregated and analyzed on departmental, company, or even country level. One potential piece of information which could be gathered from the answers given is the grams of CO2e you and/or your team produces in one day just by sending emails. Studies have found that one standard email equates to around 5-10 grams of CO2e. Imagine attaching large files to the email, and this number doubles or triples instantly. Gathering, sending, and especially saving data creates emissions and ultimately contributes to harming our environment. This is also true for cloud computing, where most of the data is hosted today and which is still experiencing immense growth. There are already some early initiatives to make cloud computing “green”, by focusing on the environmental impact of data centers. Nevertheless, there are still some hurdles to be cleared in order to achieve the goal of green data, which cannot only be realized at the data center level.

What is the root cause of the immense need for data collection and storage?

In the past, companies - especially in big tech - have gathered as much information as they could find. For simplification, let’s use the image of a giant hoover gathering all the information available, even though you are only after that one needle you have been looking for in your carpet for a long time and that keeps stabbing you in the foot. Clearly, you never know if there is a second needle which might stab you in the future (to relate this to the business context, such a needle might be an issue that needs to be solved or a piece of information you would need for a different, important analysis). Due to ever lower storage and computation costs, widespread data gathering, and storage has made sense in the past; however, the situation is changing.

The “hoover” data gathering strategy has two main negative side effects, which need to be accounted for:

SNR (Signal to Noise Ratio) – Data analysts and companies can no longer detect the relevant information, i.e., the signal, within the sheer amount of data they are provided with. Companies start to be more concerned with thinking what to do with the data and complying with storage and retention obligations established by law than actually working with the data. The high volumes of data that need to be processed lead to an immense consumption of human resources of highly trained professionals. In the current situation, these professionals are not only difficult to find in the labor market, but also expensive to maintain. A possible solution for this issue, of course, would be to use computer/software-based technology. There are already a vast amount of Big Data tools available on the market but using one of these tools always runs the risk of high associated costs without creating a comparable impact on daily business. Even well-known and successful data tools need time and the right input variables to filter appropriate results from the huge data pool. This process also requires energy, which can lead to CO2e emissions and ultimately cost the company money. In addition, there is the risk that the tools only filter out correlations and not causations from the data. The lack of causality in non-hypothesis-tested data could, in the worst case, even lead to wrong decisions that can harm the company in the long run.

An example of misinterpreting data in this way could be the layout of retail shops. Imagine that in a recent analysis of collected data, it is found that customers in shops with a larger footprint spend more time in the shop and make more impulse purchases. Thus a higher volume of purchases per customer and also per square meter of retail space is achieved. One could now conclude, regardless of other factors, that stores with a larger footprint are more attractive from the sales point of view. Without a controlled experiment or hypothesis-based testing, the decision could be made to build all new stores on a larger footprint - even in the city. However, it is difficult to find out whether this relationship was causal. It could also be that people in rural areas spend more time in the store due to buying more and the longer distances for weekly shopping. In addition they might be more inclined to make impulse purchases due to other demographic factors. If you use the aggregated data and apply the logic to city shops, this conclusion does not necessarily hold true.

Greenhouse gas emissions - Collecting and storing big amounts of data requires a lot of energy, which contributes negatively to a company’s ecological footprint. We are observing a very positive trend, in that more and more companies are proactively addressing climate change. This also includes large providers of cloud computing services (e.g., Google Cloud, Microsoft Azure, AWS, and even Tencent) which are actively marketing their cloud as a "green" cloud. Marketing activities include publishing KPIs such as PUE (Power Usage Effectiveness) and the use of carbon offsets (especially in the US). Changing the way companies gather, use, and store data can have an impactful effect in curbing CO2e emissions and reaching carbon targets. Gathering a huge amount of data which is ultimately not even needed costs the company money (e.g., electricity costs) and does not create value.

What can we do about it?

The question to be addressed is how to decide which pieces of information should be gathered and stored. The simple solution for this data problem would be to only collect the data for which you have use cases at a given period of time. BUT this only works to a limited extent because the use cases change over time. Two years down the line, missing data that could have been gathered earlier might increase project costs and duration significantly. In our opinion, there are three actions which help companies to improve their data management when focusing specially on decarbonization.

  • Plan ahead

    Companies should think about tomorrow's use cases now. For sure, that is easier said than done, but also bear in mind that in our ever-changing world keeping very long historical data is only useful for very specific cases anyway. Planning in advance which data might be needed and which topics are very unlikely to be needed can already be done now. Beware, however, of the trap of collecting mindlessly. A focus on the most relevant information must be ensured.

  • Divide into hot and cold data

    The decision to store "cold/dormant" data is one that has to be made consciously and is especially aimed at information that does not need to be directly and always accessible. Obviously, this consideration should already be included in the aforementioned use case planning. Data that is needed for projects with longer planning horizons and without tight deadlines can be stored cold in this way. Current offers from storage service providers show that cold storage is on average half the cost of hot storage, this being specifically due to the lower energy consumption.

  • Implement agile data management

    Of course, planning once and classifying data into different categories is not enough. Iterative and recurring processes to identify relevant data and determine the granularity of this data is the path to success. Following the motto that it is sometimes necessary to prune the tree so that it can grow again, agile data management should be implemented within companies to guarantee reevaluation and energy efficiency.

Not following the trend of focused data collection, analysis and storage will ultimately lead to high inefficiencies, useless analyses, expensive data storage, and a bad impact on the environment. With our proven methodology, your company can save up to 10-20% of data holding costs and will lower its carbon footprint significantly.

Michael Zeitelberger also contributed to this article.

Contact us

Gregor Kronenberger

Gregor Kronenberger

Partner, Strategy& Germany

Dr. Matthias Schlemmer

Dr. Matthias Schlemmer

Partner, Strategy& Austria

Ruth Melches

Ruth Melches

Director, Strategy& Germany