The University of Sheffield
Browse

Outlier Set Two-step Method (OSTI)

Version 3 2025-07-01, 22:46
Version 2 2025-03-25, 15:58
Version 1 2025-01-24, 11:43
dataset
posted on 2025-07-01, 22:46 authored by Amal SarfrazAmal Sarfraz, Abigail Birnbaum, Flannery Dolan, Jonathan Lamontagne, Lyudmila MihaylovaLyudmila Mihaylova, Charles RougeCharles Rouge
<p dir="ltr">These files are supplements to the paper titled 'A Robust Two-step Method for Detection of Outlier Sets'.</p><p dir="ltr">This paper identifies and addresses the need for a robust method that identifies sets of points that collectively deviate from typical patterns in a dataset, which it calls "outlier sets'', while excluding individual points from detection. This new methodology, Outlier Set Two-step Identification (OSTI) employs a two-step approach to detect and label these outlier sets. First, it uses Gaussian Mixture Models for probabilistic clustering, identifying candidate outlier sets based on cluster weights below a predetermined threshold. Second, OSTI measures the Inter-cluster Mahalanobis distance between each candidate outlier set's centroid and the overall dataset mean. OSTI then tests the null hypothesis that this distance does not significantly differ from its theoretical chi-square distribution, enabling the formal detection of outlier sets. We test OSTI systematically on 8,000 synthetic 2D datasets across various inlier configurations and thousands of possible outlier set characteristics. Results show OSTI robustly and consistently detects outlier sets with an average F1 score of 0.92 and an average purity (the degree to which outlier sets identified correspond to those generated synthetically, i.e., our ground truth) of 98.58%. We also compare OSTI with state-of-the-art outlier detection methods, to illuminate how OSTI fills a gap as a tool for the exclusive detection of outlier sets.</p>

History

Related Materials

Ethics

  • There is no personal data or any that requires ethical approval

Policy

  • The data complies with the institution and funders' policies on access and sharing

Sharing and access restrictions

  • The uploaded data can be shared openly

Data description

  • The file formats are open or commonly used

Methodology, headings and units

  • There is a file including methodology, headings and units, such as a readme.txt

Responsibility

  • The depositor is responsible for the content and sharing of the attached files

Usage metrics

    Department of Civil and Structural Engineering

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC