Perspective analysis: How to deal with the challenge of massive data?

With the development of new Internet technologies such as the Internet of Things and social networks, while bringing convenience to people, it also produces a large amount of structured and unformatted data. How to obtain useful information from massive data through data mining, provide users with a good user experience and enhance the competitiveness of enterprises is a challenge for enterprises.

According to Baidu William Zhang, each giant's data volume exposure, “hundreds of PBs, the data that needs to be processed every day is dozens of terabytes. Many data must be processed within a few seconds. Many data must be processed within a few minutes. Processing is more than enough processing within a few hours, and it is difficult to process dozens of petabytes of data for several hours.All the strategies are based on the timeliness of real-time and data processing. The needs of Internet users are more real-time, such as Weibo, group purchases, and spikes, so real-time processing is very important."

Yahoo! Zhou Yuping said, "Yahoo's cloud computing platform has an overview of more than 30,000 machines, the largest being over 4,000 or so, and the total storage capacity exceeding 100 PB." He said that Yahoo recently put a lot of effort in protecting user privacy and data security. In the past, the European Union stipulated that Yahoo could not store users' data for more than one hundred days. Although it could not be preserved, Yahoo made in-depth data mining and obtained some truly valuable information from the data reconciliation and saved it.

SAP Du Tao also introduced the level of data volume that it needs to help customers handle. “SAP, as an enterprise-level application provider, pays more attention to the amount of data for customers. These customers include small and medium-sized enterprises to large-scale customers, including data. Intensive businesses, such as telecommunications, finance, government, retail, etc., have orders of magnitude from a few terabytes to hundreds of terabytes.” At the same time, SAP itself has a large data center, mainly for SAP customers. Services, with 30,000 servers, data volume is about 15TB and more and more customers' data will be in SAP's data center.

How to store and use these huge amounts of data?

How is this huge amount of data stored, analyzed, and processed? And based on the mining of massive data, to create greater business value for the company? A number of experts introduced their own magic weapons for the technical architecture of the mining of massive data.

SAP Du Tao introduced SAP's massive data processing method in two aspects. "On the one hand, in SAP's data center, standard cloud computing is used for virtualization and distributed storage; on the other hand, for individual companies, SAP The in-memory computing technology took place in Germany on April 16. Massive data reading and analysis, under the traditional architecture, disk-based IO is calculated in milliseconds, and processing time in memory RAM is nm-level. So SAP puts the customer's data into memory after compression, reads and analyzes it, puts the analysis at the application layer into memory, improves performance, and helps users make the most of their data."

Yahoo! The cloud computing system is centered on Hadoop. Zhou Yuping introduced it and started with data acquisition (HDFS), data storage and processing, and data services, and introduced Yahoo's massive data processing solutions. In terms of data collection, yahoo established DataHave to collect data from hundreds of thousands of machines in Yahoo's global data centers in real time. It has two main roads. The main road is responsible for the data after high-throughput cleanup through various filters. On the Hadoop platform. However, the real-time performance of such a processing method is not very high. In order to meet the real-time requirements, there is a bypass system that can directly import data sources to Hadoop in seconds. Yahoo's data processing is based on Hadoop's real-time services, while Yahoop has a large set of service systems that are required by different business logics. More than half of the data processing uses the HadoopPig data engine.

Baidu William Zhang said that in the face of cloud computing on the Internet, big search is based on indexes. How to quickly update data in real time requires some optimization. For example, according to the frequency of data update, it is based on the system that updates the system quickly or updates the system slowly. It is based on geographical login and importance login to put it in the south or north room, that is, mainly According to the application of data strategy. The algorithm of machine learning, the calculation of high-complexity data in memory, may take a long time, which is not feasible in Baidu's environment, whether it is to determine the user's needs, or to know what needs to be recommended from the user's behavior. Content and what kind of advertising, these require very high timeliness and extremely large-scale machine learning.

Marble Cutter

Marble Cutter,Power Tool,Hand Tool

Fenghua Jade Motor Co., Ltd. , http://www.goselectric.com