A deep dive into the importance and business potential of distributed data computing

2023-09-02 10:02:49

*Editor's note: This article is mainly based on David Aronchick's speech at the 2023 Filecoin Unleashed Conference in Paris. David is the CEO of Expanso and the former head of data computing at Protocol Labs, which launched the Bacalhau project. This article represents the independent opinions of the original content creator, and permission has been granted to republish. *

According to IDC, by 2025, the amount of data stored globally will exceed 175 ZB. This is a huge amount of data, equivalent to 175 trillion 1 GB USB flash drives. Most of this data is generated between 2020 and 2025, with an expected CAGR of 61%.

Today, two major challenges arise in the rapidly growing datasphere:

** Mobile data is slow and expensive. **If you tried to download 175 ZB of data with current bandwidth, it would take approximately 1.8 billion years.
**Compliance tasks are onerous. **There are hundreds of data-related regulations around the world, making cross-jurisdictional compliance nearly impossible.

The combined result of lackluster network growth and regulatory constraints is that nearly 68% of agency data is idle. Because of this, it is particularly important to move computing resources to the data storage location (broadly called compute-over-data, that is, "data computing") rather than moving data to the computing location. Bacalhau et al. Compute over Data (CoD) Platforms are working on it.

In the following chapters we will briefly introduce:

How organizations handle data today.
Propose alternative solutions based on "data computing".
Finally, hypothesize why distributed computing is important.

status quo

Currently, there are three main ways in which organizations are addressing data processing challenges, none of which are ideal.

Using a centralized system

The most common approach is to use centralized systems for large-scale data processing. We often see organizations combining computing frameworks such as Adobe Spark, Hadoop, Databricks, Kubernetes, Kafka, Ray, etc. to form a network of clustered systems connected to a centralized API server. However, these systems cannot effectively address network breaches and other regulatory issues surrounding data mobility.

This is partly responsible for agencies incurring billions of dollars in administrative fines and penalties due to data breaches.

Build it yourself

Another approach is for developers to build custom coordination systems that have the awareness and robustness the agency needs. This approach is novel, but often faces the risk of failure due to over-reliance on a small number of people to maintain and run the system.

Do nothing

Surprisingly, most of the time, institutions do nothing with their data. For example, a city can collect a large amount of data from surveillance videos every day, but due to the high cost, this data can only be viewed on a local machine and cannot be archived or processed.

Build true distributed computing

There are two main solutions to data processing pain points.

Solution 1: Build on an open source data computing platform

Solution 1: Open source data computing platform

Developers can use an open-source distributed data platform for computation instead of the custom coordination systems mentioned earlier. Because the platform is open source and extensible, institutions only need to build the components they need. This setup satisfies multi-cloud, multi-computing, non-datacenter scenarios and can navigate complex regulatory environments. Importantly, access to the open source community is no longer dependent on one or more developers for system maintenance, reducing the likelihood of failures.

Solution 2: Built on distributed data protocol

With the help of advanced computing projects such as Bacalhau and Lilypad, developers can go one step further and build systems not only on the open source data platforms mentioned in Solution One, but also on truly distributed data protocols such as the Filecoin network.

Solution 2: Distributed Data Computing Protocol

This means that institutions can use distributed protocols that understand how to coordinate and describe user problems in a finer-grained way, unlocking areas of computing close to where data is generated and stored. This transition from a data center to a distributed protocol can ideally be done with only minor changes to the data scientist's experience.

Distribution means maximizing choice

By deploying on a distributed protocol such as the Filecoin network, our vision is that users can access hundreds (or thousands) of machines distributed in different regions on the same network, and follow the same protocol rules as other machines. This essentially opens up an ocean of options for data scientists, as they can ask the network to:

Select a dataset from anywhere in the world.
Follow any governance structure, whether it's HIPAA, GDPR or FISMA.
Run at the cheapest price possible.

Juan's Triangle | Decoding Acronyms: FHE (Fully Homomorphic Encryption), MPC (Multi-Party Computation), TEE (Trusted Execution Environment), ZKP (Zero-Knowledge Proof)

Speaking of the concept of choice maximization, we have to mention the "Juans triangle". This term was coined by Juan Benet, the founder of Protocol Labs, to explain why different use cases (in the future) will have different distributed computing networks. Created when supported.

The Juan Triangle proposes that computing networks often require a trade-off between privacy, verifiability, and performance, and the traditional "one size fits all" approach is difficult to apply to every use case. Instead, the modular nature of distributed protocols enables different distributed networks (or subnetworks) to meet different user needs—whether privacy, verifiability, or performance. Ultimately, we optimize based on what we think is important. At that time, there will be many party service providers (shown in the box inside the triangle) to fill these gaps and make distributed computing a reality.

In summary, data processing is a complex problem that requires out-of-the-box solutions. Replacing traditional centralized systems with open source data computing is a good first step. Ultimately, by deploying a computing platform on distributed protocols such as the Filecoin network, computing resources can be freely configured according to users' individual needs, which is crucial in the era of big data and artificial intelligence.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

0/400

No comments

金色财经_

Trending TopicsView More
#Crypto Market Rebound
255k Popularity
#SOL Price Prediction
42k Popularity
#Double Rewards With GUSD
35k Popularity
#DOGE ETF Launch
38k Popularity
#Gate Alpha New Listings
38k Popularity

Sitemap