TY - JOUR
T1 - BioKIT
T2 - A versatile toolkit for processing and analyzing diverse types of sequence data
AU - Steenwyk, Jacob L.
AU - Buida, Thomas J.
AU - Gonçalves, Carla
AU - Goltz, Dayna C.
AU - Morales, Grace
AU - Mead, Matthew E.
AU - Labella, Abigail L.
AU - Chavez, Christina M.
AU - Schmitz, Jonathan E.
AU - Hadjifrangiskou, Maria
AU - Li, Yuanning
AU - Rokas, Antonis
N1 -
© 2022 The Author(s). Published by Oxford University Press on behalf of Genetics Society of America. All rights reserved.
PY - 2022/7
Y1 - 2022/7
N2 - Bioinformatic analysis-such as genome assembly quality assessment, alignment summary statistics, relative synonymous codon usage, file format conversion, and processing and analysis-is integrated into diverse disciplines in the biological sciences. Several command-line pieces of software have been developed to conduct some of these individual analyses, but unified toolkits that conduct all these analyses are lacking. To address this gap, we introduce BioKIT, a versatile command line toolkit that has, upon publication, 42 functions, several of which were community-sourced, that conduct routine and novel processing and analysis of genome assemblies, multiple sequence alignments, coding sequences, sequencing data, and more. To demonstrate the utility of BioKIT, we conducted a comprehensive examination of relative synonymous codon usage across 171 fungal genomes that use alternative genetic codes, showed that the novel metric of gene-wise relative synonymous codon usage can accurately estimate gene-wise codon optimization, evaluated the quality and characteristics of 901 eukaryotic genome assemblies, and calculated alignment summary statistics for 10 phylogenomic data matrices. BioKIT will be helpful in facilitating and streamlining sequence analysis workflows. BioKIT is freely available under the MIT license from GitHub (https://github.com/JLSteenwyk/BioKIT), PyPi (https://pypi.org/project/jlsteenwyk-biokit/), and the Anaconda Cloud (https://anaconda.org/jlsteenwyk/jlsteenwyk-biokit). Documentation, user tutorials, and instructions for requesting new features are available online (https://jlsteenwyk.com/BioKIT).
AB - Bioinformatic analysis-such as genome assembly quality assessment, alignment summary statistics, relative synonymous codon usage, file format conversion, and processing and analysis-is integrated into diverse disciplines in the biological sciences. Several command-line pieces of software have been developed to conduct some of these individual analyses, but unified toolkits that conduct all these analyses are lacking. To address this gap, we introduce BioKIT, a versatile command line toolkit that has, upon publication, 42 functions, several of which were community-sourced, that conduct routine and novel processing and analysis of genome assemblies, multiple sequence alignments, coding sequences, sequencing data, and more. To demonstrate the utility of BioKIT, we conducted a comprehensive examination of relative synonymous codon usage across 171 fungal genomes that use alternative genetic codes, showed that the novel metric of gene-wise relative synonymous codon usage can accurately estimate gene-wise codon optimization, evaluated the quality and characteristics of 901 eukaryotic genome assemblies, and calculated alignment summary statistics for 10 phylogenomic data matrices. BioKIT will be helpful in facilitating and streamlining sequence analysis workflows. BioKIT is freely available under the MIT license from GitHub (https://github.com/JLSteenwyk/BioKIT), PyPi (https://pypi.org/project/jlsteenwyk-biokit/), and the Anaconda Cloud (https://anaconda.org/jlsteenwyk/jlsteenwyk-biokit). Documentation, user tutorials, and instructions for requesting new features are available online (https://jlsteenwyk.com/BioKIT).
KW - bioinformatics
KW - codon
KW - gene-wise relative synonymous codon usage
KW - genetic code
KW - genome assembly quality
KW - multiple sequence alignment
UR - http://www.scopus.com/inward/record.url?scp=85134083645&partnerID=8YFLogxK
U2 - 10.1093/genetics/iyac079
DO - 10.1093/genetics/iyac079
M3 - Article
C2 - 35536198
AN - SCOPUS:85134083645
SN - 0016-6731
VL - 221
JO - Genetics
JF - Genetics
IS - 3
M1 - iyac079
ER -