High performance data profiler for big data

Fecha de publicación: 29/08/2019
Fuente: WIPO "hive"
A method for profiling a dataset includes: querying, by a data profiler executed on a distributed computing system, a metadata storage to obtain table information; allocating, by the data profiler, system resources based on the obtained table information; profiling, by the data profiler, the dataset to obtain profiling results, wherein profiling the dataset includes shuffling and repartitioning data blocks of the dataset with respect to a plurality of nodes of the distributed computing system, and computing aggregates based on the shuffled and repartitioned data blocks; and outputting, by the data profiler, the profiling results.