Background: With the surge in the volume of collected data, deduplication will undoubtedly
become one of the problems faced by researchers. There is significant advantage for deduplication
to reduce storage, network bandwidth, and system scalability of coarse-grained redundant data.
Since the conventional methods of deleting duplicate data include hash comparison and binary differential
incremental. They will lead to several bottlenecks for processing large scale data. And, the traditional
Simhash similarity method has less consideration on the natural similarity of text in some
specific fields and cannot run in parallel program with large scale text data processing efficiently. This
paper examines several most important patents in the area of data detection. Then, this paper will
focus on large scale of data deduplication based on MapReduce and HDFS.
Methods: We propose a duplicate data detection approach based on MapReduce and HDFS, which
uses the Simhash similarity computing algorithm and SSN algorithm, and explain our distributed duplicate
detection workflow. The important technical advantages of the invention include generating a
checksum for each processed record and comparing the generated checksum to detect duplicate record.
It produces the fingerprints of short text with Simhash similarity algorithm. It clusters the fingerprint
results using Shared Nearest Neighbor (SNN) algorithm. The whole parallel progress is implemented
using MapReduce programming model.
Results: From the experimental results, we conclude that our proposed approach obtains MapReduce
job schedules with significantly less executing time, making it suitable for processing large scale
datasets in real applications. The experimental results show the proposed approach has better performance
Conclusion: In this patent, we propose a duplicate data detection approach based on MapReduce and
HDFS, which uses the Simhash similarity computing algorithm and SSN algorithm. The results show
that the new approach is applied to MapReduce, which is suitable for the document similarity calculation
of large scale data sets, which greatly reduces the time overhead, has higher precision and recall
rate, and provides some reference value for solving the same problem in large scale data. The invention
is also applied to large scale duplicate data detection. And it is a good solution for large scale data
process issue. In the future, we plan to design and implement a scheduler for MapReduce jobs and
new similarity algorithm with the primary focus of large scale duplicate data detection.