File Types¶
Chewbacca uses several filetypes throughout its functions. Getting acquainted with these filetypes will save you time and a lot of headaches.
Fasta File¶
Common Extensions: .fa, .fasta, .FASTA
- Fasta files are commonly used in Biological sciences to store data and metadata about genetic sequences.
- Read more here: https://en.wikipedia.org/wiki/FASTA_format
FastQ File¶
Common Extensions: .fq, .fastq, .FASTQ
- FastQ files are very simmiliar to fasta files, but include quality scores for the integrity of the genetic sequences.
- Read more here: https://en.wikipedia.org/wiki/FASTQ_format
Groups File¶
Common Extensions: .groups
Groups files are used by Chewbacca to keep track of the identities and counts of sequences in groups/clusters/OTUs throughout the analytical process. Ultimately, this data is used to generate an OTU Table. These files are generated and updated when dereplication or clustering occur.
A Groups file consists of one or more lines in the following format:
GROUPNAME <tab> SequenceName <space> SequenceName <space> ...
Example
Rodent_gutID111 Rodent_gutID111 Rodent_gutID112 Rodent_gutID113
Rodent_noseID115 Rodent_stomachID115 Rodent_gutID117
Rodent_gutID119 Rodent_gutID119
Notes:
- The GROUPNAME for a group/cluster will likely be the name of a sample within that group/cluster.
This means that one sequence name will likely appear twice on a line (once as a GROUPNAME, and once as a SequenceName). This is not an error, but the intended format.
See the “naming conventions” section for more info on chewbacca sequence naming standards.
In any given .groups file, a sequence name should be listed in ONE line (in one group/cluster/otu).
The sequencing data for sequences are stored separately.
Samples File¶
Common Extensions: .samples
Samples files are used by Chewbacca to map sequence names to the the name of their respective sample names. This file is generally written once, early on in the analytical process, at the time of sequence renaming. The primary purposes for writing this file are for annotation and construction of an OTU table at the end of the analysis.
A Samples file consists of one or more lines in the following format:
SequenceName <tab> SampleName
Example
Rodent_gutID111 GUT_SAMPLE_21
Rodent_gutID112 GUT_SAMPLE_21
Rodent_gutID113 GUT_SAMPLE_22
Rodent_noseID115 NOSE_SAMPLE_1
Rodent_stomachID115 STOMACH_SAMPLE_2
Barcodes file¶
Common Extensions: .barcodes, .txt
Barcodes files map the nucleotide prefixes used for multiplexing, to the samples they code for.
A Samples file consists of one or more lines in the following format:
<Sample_name> <tab> <barcode_sequence>
Example
GUT_SAMPLE_21 AGACGC
GUT_SAMPLE_22 AGTAGT
NOSE_SAMPLE_1 ACTAGG
STOMACH_SAMPLE_2 AGTACG
Adapters file¶
Common Extensions: .adapters, .txt, .fa, .fasta
Adapters files are fasta files that contain the forward-read adapters pyrosequencing adapters. An Adapters file should be paired with an RC Adapters file, and should contain the same number of entries as its paired RC Adapters file.
Example
>adapter1
GGWACWGGWTGAACWGTWTAYCCYCC
>adapter2
TANACYTCNGGRTGNCCRAARAAYCA
RC Adapters file¶
Common Extensions: .adapters, .txt, .fa, .fasta
RC Adapters files are fasta files that contain the Reverse-read adapters (Reverse-Complemented forward-read adapters) pyrosequencing adapters. An RC Adapters file should be paired with an Adapters file, and should contain the same number of entries as its paired Adapters file.
Example
>adapter1_RC
TGRTTYTTYGGNCAYCCNGARGTNTA
>adapter2_RC
GGRGGRTAWACWGTTCAWCCWGTWCC
Tax file¶
Common Extensions: .tax, .out, .txt
Tax files are condensed versions of blast6 output files, detailing the match between a query sequence and a possible identification. These files are generated by the :ref`id_OTU` command, and ingested by the :ref`annotate_OTU` command.
Given the blast6 output format, a Tax file consists of one or more lines in the following format:
<query> <tab> <target> <tab> <id> <tab> <alnlen> <tab> <qcov>
Example
BALI4606_0_ID1264_2 GBMAA1117-14 90.6 265 84.7 Animalia;Porifera;Demospongiae;Haplosclerida;Phloeodictyidae;;Calyx;Calyx podatypa
BALI4462_0_ID921_1 GBCI5234-15 98.8 258 82.4 Animalia;Cnidaria;Anthozoa;Alcyonacea;Xeniidae;;Xenia;Xenia sp. 1 CSM2014
BALI4673_0_ID837_1 KHA237-14 96.1 279 100.0 Animalia;Cnidaria;Anthozoa;Actiniaria;;;;
OTU Table¶
Common Extensions: .txt
OTU tables are commonly used in Biological surveys to list OTU abundances in different samples.
OTU tables consist of a header line in the following format:
OTU <tab> <Samplename1> <tab> <Samplename2> <tab> <Samplename3> ...
followed by one or more lines (one per OTU) in the follwing format:
<OTU_name> <tab> <Abundance at Samplename1> <tab> <Abundance at Samplename2> <tab> <Abundance at Samplename3>
Example
OTU Rat_Gut_Sample1 Rat_Nose_Sample1
Rat_Gut_ID3 3 0
Rat_Gut_ID25 1 1
Mapping file¶
Common Extensions: .mapping, .txt
Mapping files are artifacts of renaming (via the Sequence Renaming command), and map old sequence names to new sequence names. This allows users to use more convenient sequence names during analysis, while still having access to the original sequence names for record keeping.
A Mapping file consists of one or more lines in the following format:
<old_sequence_name> <tab> <new_sequence_name>
Example
M03292:26:000000000-AH6AG:1:1101:16896:1196 BALI4462_0_ID1
M03292:26:000000000-AH6AG:1:1101:12506:1361 BALI4462_0_ID2
M03292:26:000000000-AH6AG:1:1101:15278:1402 BALI4462_0_ID3
M03292:26:000000000-AH6AG:1:1101:16930:1429 BALI4462_0_ID4