关于二进制数据回归模型中可变选择方法的审查和建议

论文标题

关于二进制数据回归模型中可变选择方法的审查和建议

A review and recommendations on variable selection methods in regression models for binary data

论文作者

Bag, Souvik, Gupta, Kapil, Deb, Soudeep

论文摘要

逻辑回归中基本变量的选择至关重要，因为它在医学研究，金融，经济学和相关领域中广泛使用。在本文中，我们探讨了logistic回归设置中的频繁变量选择方法的四种主要类型（基于测试，基于惩罚，基于筛查和树的）。这项工作的主要目的是为从业者提供有关现有文献的全面概述。基本的假设和理论以及实施的细节也被详细介绍。接下来，我们进行了一项彻底的仿真研究，以探索可变选择，系数的估计，预测准确性以及在各种设置下的时间复杂性方面的15种不同方法的性能。我们采用低维度和高维设置，并考虑协变量的不同相关结构。本研究还包括使用高维基因表达数据的现实生活应用，以进一步了解该方法的疗效和一致性。最后，根据我们在模拟数据中和实际数据中的发现，我们为从业者提供有关在各种情况下选择变量选择方法的建议。

The selection of essential variables in logistic regression is vital because of its extensive use in medical studies, finance, economics and related fields. In this paper, we explore four main typologies (test-based, penalty-based, screening-based, and tree-based) of frequentist variable selection methods in logistic regression setup. Primary objective of this work is to give a comprehensive overview of the existing literature for practitioners. Underlying assumptions and theory, along with the specifics of their implementations, are detailed as well. Next, we conduct a thorough simulation study to explore the performances of fifteen different methods in terms of variable selection, estimation of coefficients, prediction accuracy as well as time complexity under various settings. We take low, moderate and high dimensional setups and consider different correlation structures for the covariates. A real-life application, using a high-dimensional gene expression data, is also included in this study to further understand the efficacy and consistency of the methods. Finally, based on our findings in the simulated data and in the real data, we provide recommendations for practitioners on the choice of variable selection methods under various contexts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题