Available Commands¶

Below is a list of the available Chewbacca commands.

Error Correction¶

class preclean.Preclean_Command.Preclean_Command(args_)[source]¶

Attempts to fix minor sequencing errors caused by pyrosequencing. By reducing errors prior to sequence assembly, a greater number of paired reads can be sucessfully assembled. Matching forward and reverse files should be identically named, except for a <forward>/<reverse> suffix that indicates the read orientation. The two suffix conventions below are supported. Choose ONE suffix style and stick to it! Mixed suffixes are not supported.

_forwards/_reverse
and
_R1/_R2

Inputs:

fastq file(s) with left reads .
fastq file(s) with right reads .

Outputs:

<left reads file>_corrected.fastq file(s).
<right reads file>_corrected.fastq file(s).

Example:

Assuming a forwards read file ‘Data_R1.fq’ and a reverse reads file ‘Data_R1.fq’,

./
    Data_R1.fq
    Data_R2.fq

$ python chewbacca.py preclean -f Data_R1.fq -r Data_R2.fq -o rslt

rslt/
    Data_R1_corrected.fq
    Data_R2_corrected.fq

default_program¶: alias of Preclean_Program_Bayeshammer

Assembling Sequences¶

class assemble.Assemble_Command.Assemble_Command(args_)[source]¶

Assembles reads from two (forward and reverse) fastq files/directories. For a set of k forward read files, and k reverse read files, return k assembled files. Matching forward and reverse files should be identically named, except for a <forward>/<reverse> suffix that indicates the read orientation. The two suffix conventions below are supported. Choose ONE suffix style and stick to it! Mixed suffixes are not supported.

_forwards/_reverse
and
_R1/_R2

Inputs:

fastq file(s) with left reads
fastq file(s) with right reads

Outputs:

fastq File(s) with assembled reads

Notes:

Choose ONE suffix style and stick to it! Mixed suffixes are not supported. e.g. Sample_100_forwards.fq and Sample_100_reverse.fq will be assembled into Sample_100_assembled.fq. Simmilarly, Sample_100_R1.fq and Sample_100_R2.fq will be assembled into Sample_100_assembled.fq. However, Sample_100_forwards.fq and Sample_100_R2.fq are not guaranteed to be matched.
You can provide as many pairs of files as you wish as long as they follow exactly one of the above naming conventions. If a ‘name’ parameter is provided, it will be used as a filename (not path) prefix for all assembled sequence files.

Example

Assuming a forwards read file ‘Data_R1.fq’ and a reverse reads file ‘Data_R1.fq’,

./
    Data_R1.fq
    Data_R2.fq

$ python chewbacca.py assemble -n BALI -f Data_R1.fq -r Data_R2.fq -o rslt

rslt/
    BALI_DATA.assembled.fq

default_program¶: alias of Assemble_Program_Pear

Demultiplexing by Barcode¶

Sequence Renaming¶

class rename.Rename_Command.Rename_Command(args_)[source]¶

Renames sequences in a file with their filename and a serial ID#. Useful for simplifying complex naming systems into human-readable sequence names. In order to ensure the correct sample names are preserved, it is reccomended that this command be run immediately after the Demux Command.

Inputs:

One or more fasta/fastq files to rename.

Outputs:

_renamed.<ext> file - A <fasta/fastq> file with the renamed sequences.
.samples file - A Samples File.
.mapping file - A Mapping file.

Notes:

In order for the .samples file to correctly list the sample name of the sequences in a file, this command should be run immediately after the Demux Command.
The –clip parameter tells Chewbacca that trailing _<file_ID#> (from the demultiplexing command) should not

be considered part of the sample name. By default this is set to True, and should be fine. If you notice parts of your sample names getting clipped off in your .samples file, you should explicitly set this parameter to False.
Each input file will have a corresponding .samples, .mapping, and _renamed file.
The .samples file is needed by downstream Chewbacca processes (Building the OTU Table).
The .mapping file is purely for user convenience and record-keeping.

Example:

SampleA_0.fasta:
    @M03292:26:000000000-AH6AG:1:1101:22127:1256
    AAAA
    @M03292:26:000000000-AH6AG:1:1101:22127:1257
    AAAT

$ python chewbacca.py rename -i SampleA_0.fasta -o rslt

rslt/SampleA_0_renamed.fasta:
    @SampleA_ID0
    AAAA
    @SampleA_ID1
    AAAT

rslt_samples/SampleA_0_renamed.samples:
    SampleA_ID0 SampleA
    SampleA_ID1 SampleA

rslt_aux/SampleA_0_renamed.mapping:
    M03292:26:000000000-AH6AG:1:1101:22127:1256 SampleA_ID0
    M03292:26:000000000-AH6AG:1:1101:22127:1257 SampleA_ID1

Adapter Removal¶

class clean.Clean_Adapters_Command.Clean_Adapters_Command(args_)[source]¶

Removes sequencing adapters (and preceeding barcodes) from sequences in input file(s). Sequences should be in the following format:

<BARCODE><ADAPTER><SEQUENCE><RC_ADAPTER>.

Valid ADAPTER sequences, and their reverse-complements (RC_ADAPTER) should be defined separately in a pair of fasta-formatted files. Sequences passed to this command should have already been demultiplexed, as this process will remove the identifying barcode sequences.

Inputs:

One or more fasta/fastq files to clean.
A single Adapters file file
A single RC Adapters file file

Outputs:

<filename>_debarcoded.<ext> file(s) - <fasta/fastq> files, containing sequences with their leading adapters, trailing adapters, and barcodes removed.

Notes:

Be aware of the program-specific details around ‘N’ nucleotide characters.

Example:

Given Data_ID#1 with barcode=AGACGC:

./
    Data.fasta:
        @Data_ID#1
        AGACGCGGWACWGGWTGAACWGTWTAYCCYCCATCGATCGATCGTGRTTYTTYGGNCAYCCNGARGTNTA

    Data.adapters:
        >1
        GGWACWGGWTGAACWGTWTAYCCYCC

    Data.adaptersRC:
        >first
        TGRTTYTTYGGNCAYCCNGARGTNTA

$ python chewbacca.py trim_adapters -i Data.fasta -o rslt -a Data.adapters -arc Data.adapters_RC

rslt/
    Data_debarcoded.fastq:
        @Data_ID#1
        ATCGATCGATCG

default_program¶: alias of Clean_Adapters_Program_Flexbar

Quality Cleaning¶

class clean.Clean_Quality_Command.Clean_Quality_Command(args_)[source]¶

Removes regions of low quality from fastq-formatted reads. These regions are likely sources of error, and would be detrimental to other analytical process. Input sequences to this command should have already been demultiplexed, and had their barcodes/adapters removed. Otherwise, the partial removal of these markers would leave behind invalid fragments that would be difficult to detect and likely cause errors down-stream.

Inputs:

One or more fastq files to clean.

Outputs:

<filename>_cleaned.fastq file(s) - Fastq files, containing sequences with areas of low quality removed.

Notes:

Be aware of the program-specific details around ‘N’ nucleotide characters.
Be aware of the program-specific defaults for minimum surviving sequence lengths.

Example:

./
    Data.fasta:
        @Data_ID#1
        GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCTTTACAG
        +
        !zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz%%%zzzz

The command below asks Chewbacca to trim away any section of length 3 NT in Data_ID#1 that has quality worse than 20, keeping the longer of the remaining ends. If the remaining sequence at the end of this process is shorter than 15 NT, discard the whole sequence (these values are chosen for illustrative purposes).

$ python chewbacca.py clean_seqs -i Data.fasta -o rslt -m 15 -w 3 -q 20

Note that the ‘TTT’ subsequence has been cut, because its average quality (5) is less than the threshold (20). After this cut, the longest surviving subsequence (the subsequence to the left of the cut) was kept, and the shorter subsequence (to the right of the cut) was discarded. Because the final sequence is longer than 15NT, it is kept and written to the output file.

rslt/
    Data_cleaned.fastq:
        @Data_ID#1
        GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTC
        +
        !zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz

default_program¶: alias of Clean_Quality_Program_Trimmomatic

File Conversion¶

class util.Convert_Fastq_Fasta_Command.Convert_Fastq_Fasta_Command(args_)[source]¶

Converts a Fasta-formatted file to a FastQ-formatted file. Useful for reducing data size and preparing for fasta-only operations.

Inputs:

One or more fastq files to convert to fasta.

Outputs:

<filename>.fasta file(s) - Converted fasta files.

Example:

./
    Data.fasta:
        @Data_ID#1
        AGACGCGGWACWGGWTGAACWGTWTAYCCYCCATCGATCGATCGTGRTTYTTYGGNCAYCCNGARGTNTA

$ python chewbacca.py trim_adapters -i Data.fasta -o rslt -a Data.adapters -arc Data.adapters_RC

rslt/
    Data_debarcoded.fastq:
        @Data_ID#1
        ATCGATCGATCG

Dereplication¶

class dereplicate.Dereplicate_Command.Dereplicate_Command(args_)[source]¶

Dereplicates a fasta file by grouping identical reads together under one representative sequence. The number of duplicate/replicant sequences each representative sequence represents is given by a ‘replication count’ at the end of the sequence name in output fasta file. If a .groups file is provided, then previous replication counts will be take in into account (e.g. Imagine a representative sequence X that represents 3 sequences. If X is found to be a replicant of another sequence Y, X will add 3 to replication count of Y). Replication counts are denoted with a suffix of ‘_K’ on the sequence name, where K is the replication count for the group that sequence represents.

Inputs:

One or more fasta files to dereplicate.
Optional: Groups File - A list of representative names and the names of their replicant sequences. You likely have one of these files if you’ve previously run a clustering or dereplication command.

Outputs:

_counts.fasta file - A fasta file with unique sequences and their replication counts.
_derep:ref:.groups - A list of representative names and the names of their replicant sequences.

Notes:

This command only dereplicates within each fasta file (not across all files). This means a sequence in one file will be unique within that file, but might exist in another file. To ensure sequences are uniqe across an entire dataset, merge all fasta files into one file, then dereplicate that fasta file.
Each input file will generate a corresponding _count file.
If an input .groups file is not provided, then each input fasta file will generate a new groups file named <file_name>_derep.groups. If an input .groups file IS provided, then a single groups file named ‘dereplicated_updated.groups’ will be generated.
The output .groups file is needed by downstream Chewbacca processes (Dereplication, Clustering, Building the OTU Table).
The order of sequence names in the *_counts.fasta and .groups file is arbitrary.

Example:

./
    Data.fasta
        >seq1
        AAA
        >seq2
        AAA
        >seq3
        AAAG
        >seq4
        AAAGT
        >seq7
        AAAGT

    test.groups
        seq4    seq4 seq5 seq6

In the above example, test.groups indicates that seq4 is a sequence that has previously been identified as a representative (in some earlier round of clustering or dereplication). Note that seq4 is a representative for a ‘group’ of identical sequences and therefore listed within that group.

$ python chewbacca.py dereplicate_fasta -i Data.fasta -o rslt -g test.groups

rslt/Data_counts.fasta:
    >seq4_4
    AAAGT
    >seq1_2
    AAA
    >seq3_1
    AAAG

rslt_groups_files/*.groups:
    seq3        seq3
    seq1        seq2 seq1
    seq4        seq7 seq6 seq5 seq4

Notice that Data_counts.fasta lists the unique sequences from Data.fasta, and their replication counts. Also notice that seq4 had previous replication data (stored in the Data.groups file).

default_program¶: alias of Dereplicate_Program_Vsearch

File Splitting¶

class util.Partition_Command.Partition_Command(args_)[source]¶

A utility command that partitions a fasta/fastq file into a set of files (of the same file format), with a user-specified (maximum) number of sequences per file. Allows users to partition a large file into segments, and perform discrete operations in run_parallel over those segments.

Inputs:

One or more fasta/fastq files to partition.
C: An integer defining the maximum number of sequences per file

Outputs:

<filename>_part_<part_#>.<ext> file(s) - <fasta/fastq> files, with at most C sequences per file.

Example:

./
    Data.fq:
        @Data_ID1
        GATTTGGGG
        +
        !zzzzzzzzz
        @Data_ID2
        GATTTGGGG
        +
        !zzzzzzzzz
        @Data_ID3
        GATTTGGGG
        +
        !zzzzzzzzz

$ python chewbacca.py convert_fastq_to_fasta -i Data.fq -o rslt/

rslt/
    Data.fasta:
        @Data_ID1
        GATTTGGGG
        @Data_ID2
        GATTTGGGG
        @Data_ID3
        GATTTGGGG

File Merging¶

class util.Merge_Command.Merge_Command(args_)[source]¶

Concatenates multiple files into a single file. Useful for combining the results of a run_parallel operation, or when preparing for cross-sample derepication.

Inputs:

A set of files to merge.
An <output_filename>.
An <output_prefix>.

Outputs:

<output_filename>.<output_prefix> - A file consisting of all the input files concatenated together.

Notes:

The order of the content in the concatenated files is not guaranteed.

Example:

targets/
    Data.fq:
        @Data_ID1
        GATTTGGGG
        +
        !zzzzzzzzz

    Data2.fa:
        @Data_ID1
        GATTTGGGG

    Blah.txt
        Hello World!

$ python chewbacca.py merge_files -i targets/ -o rslt/ -f txt -n Merged

rslt/
    Merged.txt:
        Hello World!
        @Data_ID1
        GATTTGGGG
        +
        !zzzzzzzzz
        @Data_ID1
        GATTTGGGG

File Cleaning¶

class util.Ungap_Command.Ungap_Command(args_)[source]¶

Removes target characters from a fasta/fastq file. Useful for removing gap characters from sequence alignments.

Inputs:

One or more fasta/fastq files to clean.
A string of gap characters to remove.

Outputs:

*_cleaned.<ext> file - A <fasta/fastq> file with gap characters removed from its sequences.

Example:

Data.fasta:
    >seq1
    AAAAA.A*A-A

$ python chewbacca.py ungap_fasta -i Data.fasta -o rslt -f fasta -g ".*-"

rslt/Data.fasta:
    >seq1
    AAAAAAAA

Deep Cleaning¶

class clean.Clean_Deep_Command.Clean_Deep_Command(args_)[source]¶

Performs an intensive deep-cleaning of sequences to eliminate frameshifts, detect chimeras, and determine sequence orientation. Input files to this command should first be dereplicated. Doing so will reduce the total number of alignments required, and reduce computation time.

Inputs:

One or more fasta/fastq files to deep clean (nucleotide sequences).
One reference fasta (nucleotide sequences).

Outputs:

*_AA - Amino Acid Alignment file, including reference sequences.
*_log.csv - A log listing each input sequence, and deep cleaning results for each sequence.
*_NT - Nucleotide Alignment file, including reference sequences.

Notes:

Sequences that do not meet quality cleaning standards are dropped.
The output files contain reference sequences, and odd alignment characters. Both of these need to be removed by running the Clean_Deep_Repair Command.

Example:

Data.fasta
BIOCODE.fa

$ python chewbacca.py macseAlign -i Data.fasta -o rslt -d BIOCODE.fa

rslt/Data_AA
rslt/Data_NT
rslt/Data_log.csv

default_program¶: alias of Clean_Deep_Program_Macse

Deep Cleaning Repair¶

class clean.Clean_Deep_Repair_Command.Clean_Deep_Repair_Command(args_)[source]¶

Cleans aligned files by removing gap characters and reference sequences from the file. Sequences passed to this

command should have previously been aligned.

Inputs:

*_AA - Amino Acid Alignment file, including reference sequences.
*_log.csv - A log listing each input sequence, and deep cleaning results for each sequence.
*_NT - Nucleotide Alignment file, including reference sequences.
Nucleotide reference fasta.
* The original fasta files that were passed in to the Clean_Deep Command
* The Nucleotide reference fasta that was passed to the Clean_Deep Command

Outputs:

*_MERGED.fasta - A clean fasta file with all the surviving sequences from deep cleaning.

Notes:

A single *_MERGED.fasta is generated regardless of the number of input files.

Example:

BIOCODE.fa

originalData/Data.fasta

input/
    Data_AA
    Data_NT
    Data_log.csv

$ python chewbacca.py -i input/ -o out/ -d BIOCODE.fa -s originalData/

out/
    MACSE_OUT_MERGED.fasta

default_program¶: alias of Clean_Deep_Repair_Program_Macse

Sequence Clustering¶

class cluster.Cluster_Command.Cluster_Command(args_)[source]¶

Clusters a set of fasta files. This command generates a fasta file of unique sequences (each representing a cluster) and a .groups file. This command also takes an optional .groups file containing replication data from previous commands. If a .groups file is supplied, only one output .groups file is generated (regardless of the number of inputs).

Inputs:

One or more fasta files to cluster.
Optional: Groups File - A list of representative names and the names of their replicant sequences. You likely have one of these files if you’ve previously run a clustering or dereplication command.

Outputs:

*.fasta file - A fasta file with unique sequences and their replication counts.
*.groups - A Groups File

Notes:

The input fasta file(s) should have been dereplicated before clustering.
For a single experiment with multiple fasta files, it is best to merge all input fasta files, dereplicate

them, then cluster the single merged and dereplicated fasta file. This provides the best OTU groupings.

Example:

./
    Data.fasta:
        >seq1_3
        AAAAAAAAAA
        >seq2_1
        ATAAAAAAAA
        >seq3_1
        TTTTTTTTTT
        >seq4_1
        TTTTTTATTT
        >seq5_1
        TTTTTTATCT


    Data.groups:
        seq1    seq6 seq1 seq7

$ python chewbacca.py cluster_seqs -i Data.fasta -o rslt -g Data.groups

rslt/
    Data_clustered_seeds.fasta:
        >seq1_4
        AAAAAAAAAA
        >seq3_3
        TTTTTTTTTT

rslt_groups_files/
    postcluster_updated.groups:
        seq3    seq3 seq5 seq4
        seq1    seq2 seq1 seq7 seq6

OTU Table Construction¶

class otu.Build_OTU_Table_Command.Build_OTU_Table_Command(args_)[source]¶

Builds an OTU table using a .groups, .samples, and .barcodes file. The OTU table shows OTU (group) abundance by sample.

Inputs:

One or more Samples File.
One or more Barcodes file.
one or more Groups File.

Outputs:

matrix.txt - A tab-delimited table mapping OTUs (groups) to their abundance in each sample.

Notes:

A sequence name may not appear in more than one group file (or more than one line in a gropus file for that matter!).

Example:

./
    test.barcodes
        Sample1 aaaaaa
        Sample2 aaaaat
        Sample3 aaaaac
        Sample4 aaaaag

    test.groups
        seq3    seq3 seq5 seq4
        seq1    seq2 seq1 seq7 seq6

    test.samples
        seq1    Sample1
        seq2    Sample1
        seq3    Sample1
        seq4    Sample2
        seq5    Sample2
        seq6    Sample2
        seq7    Sample3

$ python chewbacca.py build_matrix -b test.barcodes -g test.groups -s test.samples -o rslt/

rslt/
    matrix.txt
        OTU     Sample1 Sample2 Sample3 Sample4
        seq3    1       2       0       0
        seq1    2       1       1       0

OTU Identification¶

class otu.Query_OTU_DB_Command.Query_OTU_DB_Command(args_)[source]¶

Aligns sequences in a fasta file against those in a reference database in order to determine OTU identity.

Inputs:

One or more fasta files containing sequences to identify.
A curated fasta file of high quality sequences and known species.
A database containing taxonomic identifiers for sequences in the curated fasta file.

Outputs:

A Tax file.

Notes:

The files COI.fasta and ncbi.db are included in the Chewbacca Docker distributions.

Example:

~/ARMS/refs/

    COI.fasta # A precompiled fasta file of COI data from NCBI.
        >94483305
        AGGACGGATCAGACGAAGAGGGGCGTTTGGTATTGGGTTATGGCAGGGGGTTTTATATTGATAATTGTTGTGATGAAATT
        GATGGCCCCTAAGATAGAGGAGACACCTGCTAGGTGTAAGGAGAAGATGGTTAGGTCTACGGAGGCTCCAGGGTGGGAGT

    ncbi.db # A precompiled database of (Taxa) for the entries in 'COI.fasta'.


data/
    Data.fasta:
        >seq1
        GAATAGGTGTTGGTATAGAATGGGGTCTCCTCCTCCGGCGGGGTCGAAGAAGGTGGTGTTGAGGTTGCGGTCTGTTAGTAGTATAGTGATGCCAGCAG
        CTAGGACTGGGAGAGATAGGAGAAGTAGGACTGCTGTGATTAGGACGGATCAGACGAAGAGGGGCGTTTGGTATTGGGTTATGGCAGGGGGTTTTATA
        TTGATAATTGTTGTGAGGAAATTGATGGCCCCTAAGATAGAGGAGACACCTGCTAGGTGTAAGGAGAAGATGGTTAGGTCTACGGAGGCTCCAGGGTG
        GGAGTAGTTCCCTGCTAA

$ python chewbacca.py query_db -i Data.fasta -o out/ -r ~/ARMS/refs/COI.fasta -d ~/ARMS/refs/ncbi.db

rslt/
    Data_result.out
        seq1    94483305        99.4    173     55.4    Chordata:Mammalia:Primates:Hominidae:Homo:Homo sapiens

class otu.Query_OTU_Fasta_Command.Query_OTU_Fasta_Command(args_)[source]¶

Aligns sequences in a fasta file against those in a reference fasta in order to determine OTU identity.

Inputs:

One or more fasta files containing sequences to identify.
A curated fasta file of high quality sequences and known species.
A two-column, tab-delimited text file mapping sequence names in the curated fasta file to taxonomic identifiers.

Outputs:

A Tax file.

Notes:

The files ‘bold.fna’ and ‘seq_lin.mapping’ are included in the Chewbacca Docker distributions.

Example:

~/ARMS/data/
    bold.fna # A precompiled fasta file of data from BOLD.
        >GBMAA1117-14
        GGGCTTTTGCGGGTATGATAGGAACAGCATTTAGTATGCTTATTAGGTTAGAACTATCTTCCCCAGGGTCTATGTTAGGAGATGATCATTTATATAAT
        GTTATAGTAACAGCTCATGCATTTGTAATGATATTTTTTTTAGTTATGCCAGTAATGATTGGGGGTTTTGGTAATTGGTTAGTACCTTTATATATTGG
        TGCCCCGGATATGGCTTTTCCTAGATTAAATAATATTAGTTTTTGGTTATTACCTCCGGCGCTTACTTTATTATTAGGTTCGGCTTTTGTAGAACAAG
        GGGCTGGGACAGGTTGGACAGTTTATCCGCCTTTATTTAGTATTCAAACTCATTCTGGGGGGTCTGTGGATATGGTAATATTTAGTTTACATTTAGCT
        GGAATATCTTCTATATTAGGGGCTATGAATTTTATAACAACAATCTTTAATATGAGGTCTCCGGGAGTAACTATGGATAGAATGCCTTTATTTGTTTG
        ATCTGTTTTAGTAACTGCTTTTTTATTATTATTATCATTGCCAGTATTAGCTGGTGCCATAACAAGTCTTTTAACCGATCGAGATTTTAATACTACAT
        TT

    seq_lin.mapping # A precompiled two-column tab file of (Taxa) for the entries in 'bold.fna'.
        GBMAA1117-14    Animalia;Porifera;Demospongiae;Haplosclerida;Phloeodictyidae;;Calyx;Calyx podatypa

./
    Data.fasta:
        >seq1
        ACTATCAGGCATTCAAGCCCATTCAGGGGGAGCAGTAGATATGGCTATATTTAGTCTACATCTAGCTGGTGTATCCTCTATTTTAAGTTCTATAAACT
        TTATAACTACTATAATTAATATGAGGGTTCCTGGGATGAGTATGCATAGATTACCTCTATTCGTATGGTCTGTATTAGTTACTACAATATTATTGTTG
        TTATCTTTACCAGTATTAGCTGGTGGAATTACAATGTTATTGACAGATAGAAATTTTAATACAACATTCTTTGACCCTGCGGGAGGAGGAGATCCTAT
        TTTATTCCAGCACTTATTT

$ python chewbacca.py query_fasta -i Data.fasta -o rslt -r ~/ARMS/data/bold.fna -x ~/ARMS/data/seq_lin.mapping

rslt/
        Data_result.out
            seq1        GBMAA1117-14    90.6    265     84.7 Animalia;Porifera;Demospongiae;Haplosclerida;Phloeodictyidae;;Calyx;Calyx podatypa

OTU Annotation¶

class otu.Annotate_OTU_Table_Command.Annotate_OTU_Table_Command(args_)[source]¶

Annotates an OTU table with Taxonomic names by replacing sequence names in the OTU table with their identified

taxonomies.

Inputs:

An OTU Table to annotate.

One or more Tax file files to read annotations from.

Outputs:

An OTU Table with sequence names replaced by taxonomic names in the input .tax file.

Notes:

The input annotation file(s) should list only one identification per sequence name. If you find more than one taxonomic identity for a sequence, choose only one to include in the input .tax file(s).

Example:

./
    matrix.txt
        OTU     Sample1 Sample2 Sample3 Sample4
        seq3    1       2       0       0
        seq1    2       1       1       0

    data.tax:
        seq1    94483305        99.4    173     55.4    Chordata:Mammalia:Primates:Hominidae:Homo:Homo sapiens

$ python chewbacca.py annotate_matrix -i matrix.txt -a data.tax -o rslt

rslt/
    matrix.txt
        OTU     Sample1 Sample2 Sample3 Sample4
        seq3    1       2       0       0
        Chordata:Mammalia:Primates:Hominidae:Homo:Homo sapiens  2       1       1       0