The Rise of the Data Network
Some of you reading this will recall the day when we first learned that connecting things together, and making them easier to get to, was preferable to moving or copying things about to put them in a convenient place. Examples include client/server (which weaned us from the mainframe as the dominant way that business computing occurred), networked PC’s (which is now so taken for granted that just about every device we imagine wants to be networked, including my car), hyperlinks in web pages (i.e. don’t copy, link), mobile devices, and so on.
Just as we networked desktops, web pages, mobile devices, and applications, we must now seriously contemplate networking data. I will assert that networking lots of data together is what will make it Big Data, not analytics platforms like Hadoop, and not large databases. And this is a really big idea.
“Manage in Place” is why we’ll do it; “Data Networks” is how we’ll do it.
So, if the goal is to create data networks, what would such a data network infrastructure look like? While this is all still fairly new, I think we can safely make a few predictions about what a big data network would look like.
First, working from the ground up, there should be a standard way to access any data source that falls under the heading of Big Data. Think Ethernet/Ethernet connector for Big Data. This doesn’t exist yet, but needs to, so that anyone that builds a data repository of any kind (whether structured or unstructured) can link it into the big data network (and access it remotely) in a consistent, open, but secure way.
Until such a standard way exists, data network solutions will have to have (so vendors of data network components will have to build) custom connectors to each data source that needs to be managed. From an industry perspective this is rather inefficient (since vendors are going to end up writing the same interface code over and over again), but so were proprietary networks until the network industry evolved. But let’s hope, as an industry, we don’t have to live with this proprietary technology for too long.
The good news is that most widely-used data sources (Sharepoint, NFS, etc) support networked protocols that enable data networks to be built, but without having to rely on agent technology; let’s hope we’ve learned our lesson and at least can avoid having agents in the mix. Yeah, me, too.
The second component we’ll need is something to link the individual data sources together, doing for big data networks what hubs, switches, and routers do for physical networks. We don’t have good terminology yet for what the various components will do, but we can make some intelligent guesses based on lessons learned in the networking world. To make this simple, let’s just call it a Leaf Data Network Element (LDNE).
The LDNE will provide the lowest level connection to the individual data sources themselves, the data connectors. This is also where authentication (if required) will reside, providing the appropriate access controls to interface to the data sources.
Another component that must reside within the LDNE is something that tells the rest of the world about the data this device connects to. This might be considered controversial to some, but I believe this will need to be an index of some sort. It might be as simple as a high-level metadata index about the volumes themselves connected to the LDNE, it might be a metadata index about all the data elements within the various volumes connected to the LDNE, or it might be a full-text index about all the data elements within the various volumes connected to the LDNE.
(The reason the location of the index is controversial is due to the fundamental trade-offs between scale – how big a data network can one build – and richness – how much information about my data can I know in one place.) I’ll assert that we’ll ultimately want scale (so moving the index off the data source but as close to the source as possible), and will address richness using other mechanisms. But this will enable us to begin to envision what “routing” might mean in a data network, but that’s a whole other article.)
A third component that must reside within the LDNE is something that allows external, higher-level entities (Big Data Applications?) to perform operations on the data on the volumes connected to the LDNE (move, copy, delete, analyze, etc).
The final component I believe must reside on the LDNE is data access control. One point that’s often glossed over in many of the recent discussions on Big Data, especially unstructured data, is that this data is often owned by someone, either an administrator or an end user, and that owner will want to have some say in who can do what to the data they own.
Once we have our all of our data sources indexed, we can switch our focus to the highest level entity in the data network, the data network application layer (DNAL). This is where any big data application will reside, so it can look pretty much like any other application layer (user interface, application logic, persistent storage, etc), but with one exception: it will have services that allow it to access any number of LDNE’s. And, like other types of application environments, they can logically live anywhere (on PC’s, mobile devices, even in devices that don’t fit the traditional definition of an application, such as a car, home appliance, etc).
Finally, because Big Data is big, we’ll need the data network to be able to scale, to millions of volumes and exabytes of data. As we have learned in other networked disciplines that operate at scale (DNS, TCP/IP, SS8, etc), we’ll need to federate the data network, which will mean intermediate data network elements that will, well, inter-mediate. So we will need an IDNE. By enabling an arbitrary number of DNAL’s to talk to an arbitrary number of LDNE’s, the IDNE’s will become the backbone of the big data network. They might also “route” big data requests and replies between DNAL’s to LDNE’s, so that we don’t have to manage the hosts.txt file of the Big Data network.
In summary, by actually leveraging many of the ideas and components developed within other network-related disciplines, we introduce the concept of the big data network. Further, when designed and configured properly, we can now actually envision the big data network that scales, is open, secure, and rich in both data and result. After all, it worked in other disciplines, it will work here. And it’s not only the best way to avoid all the crud; it’s the only way.