File Types

Chewbacca uses several filetypes throughout its functions. Getting acquainted with these filetypes will save you time and a lot of headaches.

Fasta File

Common Extensions: .fa, .fasta, .FASTA

Fasta files are commonly used in Biological sciences to store data and metadata about genetic sequences.
Read more here: https://en.wikipedia.org/wiki/FASTA_format

FastQ File

Common Extensions: .fq, .fastq, .FASTQ

FastQ files are very simmiliar to fasta files, but include quality scores for the integrity of the genetic sequences.
Read more here: https://en.wikipedia.org/wiki/FASTQ_format

Groups File

Common Extensions: .groups

Groups files are used by Chewbacca to keep track of the identities and counts of sequences in groups/clusters/OTUs throughout the analytical process. Ultimately, this data is used to generate an OTU Table. These files are generated and updated when dereplication or clustering occur.

A Groups file consists of one or more lines in the following format:

GROUPNAME <tab> SequenceName <space> SequenceName <space> ...

Example

Rodent_gutID111 Rodent_gutID111 Rodent_gutID112 Rodent_gutID113
Rodent_noseID115        Rodent_stomachID115 Rodent_gutID117
Rodent_gutID119 Rodent_gutID119

Notes:

  1. The GROUPNAME for a group/cluster will likely be the name of a sample within that group/cluster.

    This means that one sequence name will likely appear twice on a line (once as a GROUPNAME, and once as a SequenceName). This is not an error, but the intended format.

  2. See the “naming conventions” section for more info on chewbacca sequence naming standards.

  3. In any given .groups file, a sequence name should be listed in ONE line (in one group/cluster/otu).

  4. The sequencing data for sequences are stored separately.

Samples File

Common Extensions: .samples

Samples files are used by Chewbacca to map sequence names to the the name of their respective sample names. This file is generally written once, early on in the analytical process, at the time of sequence renaming. The primary purposes for writing this file are for annotation and construction of an OTU table at the end of the analysis.

A Samples file consists of one or more lines in the following format:

SequenceName <tab> SampleName

Example

Rodent_gutID111 GUT_SAMPLE_21
Rodent_gutID112 GUT_SAMPLE_21
Rodent_gutID113 GUT_SAMPLE_22
Rodent_noseID115        NOSE_SAMPLE_1
Rodent_stomachID115     STOMACH_SAMPLE_2

Barcodes file

Common Extensions: .barcodes, .txt

Barcodes files map the nucleotide prefixes used for multiplexing, to the samples they code for.

A Samples file consists of one or more lines in the following format:

<Sample_name> <tab> <barcode_sequence>

Example

GUT_SAMPLE_21        AGACGC
GUT_SAMPLE_22        AGTAGT
NOSE_SAMPLE_1        ACTAGG
STOMACH_SAMPLE_2        AGTACG

Adapters file

Common Extensions: .adapters, .txt, .fa, .fasta

Adapters files are fasta files that contain the forward-read adapters pyrosequencing adapters. An Adapters file should be paired with an RC Adapters file, and should contain the same number of entries as its paired RC Adapters file.

Example

>adapter1
GGWACWGGWTGAACWGTWTAYCCYCC
>adapter2
TANACYTCNGGRTGNCCRAARAAYCA

RC Adapters file

Common Extensions: .adapters, .txt, .fa, .fasta

RC Adapters files are fasta files that contain the Reverse-read adapters (Reverse-Complemented forward-read adapters) pyrosequencing adapters. An RC Adapters file should be paired with an Adapters file, and should contain the same number of entries as its paired Adapters file.

Example

>adapter1_RC
TGRTTYTTYGGNCAYCCNGARGTNTA
>adapter2_RC
GGRGGRTAWACWGTTCAWCCWGTWCC

Tax file

Common Extensions: .tax, .out, .txt

Tax files are condensed versions of blast6 output files, detailing the match between a query sequence and a possible identification. These files are generated by the :ref`id_OTU` command, and ingested by the :ref`annotate_OTU` command.

Given the blast6 output format, a Tax file consists of one or more lines in the following format:

<query> <tab> <target> <tab> <id> <tab> <alnlen> <tab> <qcov>

Example

BALI4606_0_ID1264_2     GBMAA1117-14    90.6    265     84.7    Animalia;Porifera;Demospongiae;Haplosclerida;Phloeodictyidae;;Calyx;Calyx podatypa
BALI4462_0_ID921_1      GBCI5234-15     98.8    258     82.4    Animalia;Cnidaria;Anthozoa;Alcyonacea;Xeniidae;;Xenia;Xenia sp. 1 CSM2014
BALI4673_0_ID837_1      KHA237-14       96.1    279     100.0   Animalia;Cnidaria;Anthozoa;Actiniaria;;;;

OTU Table

Common Extensions: .txt

OTU tables are commonly used in Biological surveys to list OTU abundances in different samples.

OTU tables consist of a header line in the following format:

OTU <tab> <Samplename1> <tab> <Samplename2> <tab> <Samplename3> ...

followed by one or more lines (one per OTU) in the follwing format:

<OTU_name> <tab> <Abundance at Samplename1> <tab> <Abundance at Samplename2> <tab> <Abundance at Samplename3>

Example

OTU     Rat_Gut_Sample1 Rat_Nose_Sample1
Rat_Gut_ID3     3       0
Rat_Gut_ID25    1       1

Mapping file

Common Extensions: .mapping, .txt

Mapping files are artifacts of renaming (via the Sequence Renaming command), and map old sequence names to new sequence names. This allows users to use more convenient sequence names during analysis, while still having access to the original sequence names for record keeping.

A Mapping file consists of one or more lines in the following format:

<old_sequence_name> <tab> <new_sequence_name>

Example

M03292:26:000000000-AH6AG:1:1101:16896:1196     BALI4462_0_ID1
M03292:26:000000000-AH6AG:1:1101:12506:1361     BALI4462_0_ID2
M03292:26:000000000-AH6AG:1:1101:15278:1402     BALI4462_0_ID3
M03292:26:000000000-AH6AG:1:1101:16930:1429     BALI4462_0_ID4