search

menu

  • Research Research
    • Where science meets inspired minds

    • Back
    • Research
    • Our Science
    • Research Groups
    • Facilities & Platforms
    • Clinical research
    • Find a researcher
    • Publications
    • Knowledge Transfer
  • Careers & study Careers & study
    • Become a leader in cancer research

    • Back
    • Careers & study
    • Vacancies
    • Faculty
    • Scientific staff
    • Scientific support staff
    • Postdoctoral fellows
    • PhD Students
    • Operational staff
    • Clinical fellows
    • Life in Amsterdam
    • Student internships
  • News & Events News & Events
    • Check out our stories and events

    • Back
    • News & Events
    • News
    • Media & Press
    • Calendar
  • About us About us
    • Maximum impact for cancer patients

    • Back
    • About us
    • Our vision
    • Organization
    • Collaborations
    • Responsible Research
    • Support us
    • Visit us
    • Contact us
  • Support us
Support us
  • Home
  • Publications
  • Research
  • Publications
  • Article

Principal component analysis of binary genomics data.

Yipeng Song ,
Johan A Westerhuis ,
Nanne Aben ,
Magali Michaut ,
Lodewyk F A Wessels ,
Age K Smilde

Abstract

MOTIVATION

Genome-wide measurements of genetic and epigenetic alterations are generating more and more high-dimensional binary data. The special mathematical characteristics of binary data make the direct use of the classical principal component analysis (PCA) model to explore low-dimensional structures less obvious. Although there are several PCA alternatives for binary data in the psychometric, data analysis and machine learning literature, they are not well known to the bioinformatics community. Results: In this article, we introduce the motivation and rationale of some parametric and nonparametric versions of PCA specifically geared for binary data. Using both realistic simulations of binary data as well as mutation, CNA and methylation data of the Genomic Determinants of Sensitivity in Cancer 1000 (GDSC1000), the methods were explored for their performance with respect to finding the correct number of components, overfit, finding back the correct low-dimensional structure, variable importance, etc. The results show that if a low-dimensional structure exists in the data, that most of the methods can find it. When assuming a probabilistic generating process is underlying the data, we recommend to use the parametric logistic PCA model, while when such an assumption is not valid and the data are considered as given, the nonparametric Gifi model is recommended.

AVAILABILITY

The codes to reproduce the results in this article are available at the homepage of the Biosystems Data Analysis group (www.bdagroup.nl).

More about this publication

Briefings in bioinformatics

Volume 20
Issue nr. 1
Pages 317-329
Publication date 18-01-2019

Full text links

Publisher website (DOI) 10.1093/bib/bbx119
Europe PubMed Central 30657888
Pubmed 30657888

Where science meets inspired minds

Contact

Plesmanlaan 121
1066CX Amsterdam

020 512 9111 communicatie@nki.nl

Quick links

  • Vacancies
  • News
  • Contact us
  • Media & Press

Follow us on

Disclaimer
Privacy statement
Cookies
Change cookie settings

This site uses cookies

This website uses cookies to ensure you get the best experience on our website.