Handling Criteria-Driven Filtering &Sampling Across Distributed Data Partitions
Article Sidebar
Main Article Content
This paper presents an innovative five-layered architecture for finding equitable representation across the various fragmented sections of large datasets. The architecture employs a set of customizable criteria to group together similar datasets into well-balanced unbiased sample sets; while retaining complete and accurate metadata for all file boundary elements. Important features of this architecture include: a multi-level criteria engine which is capable of processing accepted and rejected streams from a variety of sources; a stratified sampling process which effectively eliminates the presence of partition skew; and a quality assurance mechanism that provides an efficient means to capture and report performance-related metrics. Benchmark partitioning results indicate a significant shift toward overall balance ratio improvements; with improvements of approximately 1.75 to 1.12 ratio, as well as an overall reduction in volume anomaly of 87% (from ±25% down to ±3.2%), and total absence of schema drift occurred at a cost of less than 5% of the overall dataset processing costs to the author(s). By implementing this method, the processing of large datasets has been made much more reliable than previously possible, while simultaneously minimising bias and maximising the ability to provide efficient audit trial capabilities. Future enhancements to the architecture will focus on providing distributed execution capabilities in conjunction with streaming integration into enterprise data systems.
Downloads
References
“Balanced Sampling”, Laurent Costa, Thomas Merly-Alpa, Département des méthodes statistiques, Version no 1, diffusée le 21 juin 2017.
“A balanced sampling approach for multi-way stratification designs for small area estimation”, Piero Demetrio Falorsi and Paolo Righi, Survey Methodology, December 2008, https://www.istat.it/en/files/2016/10/Falorsi-engSURVEY_METH.pdf.
“Data Merging: Process, Challenges, and Best Practices for Combining Data from Multiple Sources”, Ehsan Elahi, November 15, 2021, https://dataladder.com/merging-data-from-multiple-sources/.
“The Speed of Now: Examples of Real-Time Processing in Action”, Wojciech Marusarz, April 17, 2023, https://nexocode.com/blog/posts/examples-of-real-time-processing/.
“Step By Step Guide: Proportional Sampling For Data Science With Python!”, Bharath K, Oct 22, 2020, https://towardsdatascience.com/step-by-step-guide-proportional-sampling-for-data-science-with-python-8b2871159ae6/.
“Rebalance Your Portfolio? You are a Market Timer and Here’s What to Consider”, Andrew Miller, March 23rd, 2017, https://alphaarchitect.com/do-you-rebalance-your-portfolio-you-are-a-market-timer/.
“R*-Grove: Balanced Spatial Partitioning for Large-Scale Datasets”, Tin Vu, Ahmed Eldawy, 28 August 2020, https://www.frontiersin.org/journals/big-data/articles/10.3389/ fdata.2020.00028/full.
“Load balancing for partition-based similarity search”, Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang, 03 July 2014, https://doi.org/10.1145/2600428.2609624.
“Incremental Partitioning for Efficient Spatial Data Analytics”, Tin Vu, Ahmed Eldawy, Vagelis Hristidis, Vassilis Tsotras, 2022, https://doi.org/10.14778/3494124.3494150.
“Effective Spatial Data Partitioning for Scalable Query Processing”, Ablimit Aji, Hoang Vo, Fusheng Wang, 3 Sep 2015, https://arxiv.org/pdf/1509.00910.
“Distributed Partitioning and Processing of Large Spatial Datasets”, Ayman I. Zeidan, 2022, https://academicworks.cuny.edu/gc_etds/4640/.
“Benchmarking data partitioning techniques in HDFS for big real spatial data”, Nikolaos Niopas, July 10, 2019, https://staff.fnwi.uva.nl/a.s.z.belloum/MSctheses/MScthesis_Nikos_Niopas.pdf.
“Efficient spatial data partitioning for distributed $$k$$ k NN joins”, Ayman Zeidan, H. Vo, 2 June 2022, https://www.semanticscholar.org/paper/Efficient-spatial-data-partitioning-for-distributed-Zeidan-Vo/549f632ea0f0d800116cc4760571d0ff7e9eaeb5.
“A Performance Study of Big Spatial Data Systems”, Md Mahbub Alam, Suprio Ray, Virendra C. Bhavsar, November 6, 2018, https://doi.org/10.1145/3282834.3282841.

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in our journal are licensed under CC-BY 4.0, which permits authors to retain copyright of their work. This license allows for unrestricted use, sharing, and reproduction of the articles, provided that proper credit is given to the original authors and the source.