Available Commands¶
Below is a list of the available Chewbacca commands.
Error Correction¶
Note: This functionality is still untested and can over correct legitimate variation .. autoclass:: preclean.Preclean_Command.Preclean_Command
members:
Assembling Sequences¶
-
class
assemble.Assemble_Command.
Assemble_Command
(args_)[source]¶ Assembles reads from two (forward and reverse) fastq files/directories. For a set of k forward read files, and k reverse read files, return k assembled files. Matching forward and reverse files should be identically named, except for a <forward>/<reverse> suffix that indicates the read orientation. The two suffix conventions below are supported. Choose ONE suffix style and stick to it! Mixed suffixes are not supported.
_forwards/_reverse and _R1/_R2
- Inputs:
- fastq file(s) with left reads
- fastq file(s) with right reads
- Outputs:
- fastq File(s) with assembled reads
- Notes:
- Choose ONE suffix style and stick to it! Mixed suffixes are not supported. e.g. Sample_100_forwards.fq and Sample_100_reverse.fq will be assembled into Sample_100_assembled.fq. Simmilarly, Sample_100_R1.fq and Sample_100_R2.fq will be assembled into Sample_100_assembled.fq. However, Sample_100_forwards.fq and Sample_100_R2.fq are not guaranteed to be matched.
- You can provide as many pairs of files as you wish as long as they follow exactly one of the above naming conventions. If a ‘name’ parameter is provided, it will be used as a filename (not path) prefix for all assembled sequence files.
Example
Assuming a forwards read file ‘Data_R1.fq’ and a reverse reads file ‘Data_R1.fq’,
./ Data_R1.fq Data_R2.fq
$ python chewbacca.py assemble -n BALI -f Data_R1.fq -r Data_R2.fq -o rslt
rslt/ BALI_DATA.assembled.fq
-
default_program
¶ alias of
Assemble_Program_Pear
Demultiplexing by Barcode¶
-
class
demux.Demux_Barcode_Command.
Demux_Barcode_Command
(args_)[source]¶ - Given a set of files, each file is assigned a file offeset (value between sampleId and sequenceId). Each file is then split into separate child files where
- each file holds only sequences belonging to a single sample. These child files are named using the sample name for the sequences it lists, and the file offset of the file it came from. Demuxing is based on the nucleotide barcode prefixing each sequence.
- Inputs:
- One or more fasta/fastq files to demux.
- A single .barcodes file: A .barcodes.
- Outputs:
- <sample_name>_<file_id#>_ demux.<ext> file(s) - <fasta/fastq> files, containing all the sequences from file <file_id#>, which had a barcode corresponding to sample <sample_name>.
- unmatched_<file_id#>_ demux.<ext> file(s) - <fasta/fastq> files, containing sequences from file <file_id#>, whose barcode did not match any of those listed in the .barcodes file.
- Notes:
- The assignment of the offset to file should be treated as an arbitrary process and should not used for record keeping.
- Each input file will generate its own unmatched_* file (if applicable).
Example:
data/ Data1.fasta: @Seq4 AGACGCAAAAAA @Seq5 AGTGTAAAAAAT Data2.fasta: @Seq6 AGACGCAAAAAC @Seq7 AGTGTAAAAAAG @Seq8 CGTGTAAAAAAG ./ Data.barcodes: SampleA AGACGC SampleB AGTGTA
$ python chewbacca.py demux_samples -i data/ -b Data.barcodes -o rslt
Here, we see that Data1.fasta was assigned ‘0’ as an offset, while Data2.fasta was assigned ‘1’ as an offset. Because both files had sequences from SampleA, the sequences from Data1.fasta were written to SampleA_0_demux.fastq, and those sequences from Data2.fasta were written to SampleA_1_demux.fastq. The same is true for SampleB.
rslt/ SampleA_0_demux.fastq: @Seq4 AGACGCAAAAAA SampleB_0_demux.fastq: @Seq5 AGTGTAAAAAAT SampleA_1_demux.fastq: @Seq6 AGACGCAAAAAC SampleB_1_demux.fastq: @Seq7 AGTGTAAAAAAG rslt_aux/ unmatched_0_demux.fastq: @Seq8 CGTGTAAAAAAG
-
default_program
¶ alias of
Demux_Program_Fastx
Demultiplexing by Name¶
-
class
demux.Demux_Name_Command.
Demux_Name_Command
(args_)[source]¶ - Given a set of files, each file is assigned a file offset. Each file is then split into separate child files where
- each file holds only sequences belonging to a single sample. These child files are named using the sample name for the sequences it lists, and the file offset of the file it came from. Demuxing is based on unique sample names contained in sequence names.
- Inputs:
- One or more fasta/fastq files to demux. Sequences in these files should contain as a prefix the sample they came from. (This is untested)
- A single .barcodes file: A .barcodes, listing samples as they appear in sequence names, but actual barcode sequences can be made up. This command will only make use of barcode names.
- Outputs:
- <sample_name>_<offset>_ demux.<ext> file(s) - <fasta/fastq> files, containing all the sequences from file <file_id#>, which had a sequence name containing sample <sample_name>.
- unmatched_<offset>_ demux.<ext> file(s) - <fasta/fastq> files, containing sequences from file <file_id#>, whose barcode did not match any of those listed in the .barcodes file.
- Notes:
- The assignment of offset to file should be treated as an arbitrary process and should not used for record keeping.
- Each input file will generate its own unmatched_* file (if applicable).
Example:
data/ Data1.fasta: @SampleA:001 AAAAAAAAAAAA @SampleAA:002 AAAAAAAAAAAT @SampleA1:003 AAAAAAAAAAAC @Sample_B:001 AAAAAAAAAAAG Data2.fasta: @SampleAA:001 GAAAAAAAAAAA @SampleA:002 TAAAAAAAAAAA @Seq8 CAAAAAAAAAAA ./ Data.barcodes: SampleA AAA SampleAA AAA Sample_B AAA
$ python chewbacca.py demux_names -i data/ -b Data.barcodes -o rslt
Here, we see that Data1.fasta was assigned ‘0’ as an offset, while Data2.fasta was assigned ‘1’ as an offset. Because both files had sequences from SampleA, the sequences from Data1.fasta were written to SampleA_0_demux.fastq, and those sequences from Data2.fasta were written to SampleA_1_demux.fastq. The same is true for SampleB.
rslt/ SampleA_0_demux.fastq: @SampleA:001 AAAAAAAAAAAA @SampleA1:003 AAAAAAAAAAAC SampleAA_0_demux.fastq: @SampleAA:002 AAAAAAAAAAAT SampleB_0_demux.fastq: @Sample_B:001 AAAAAAAAAAAG SampleA_1_demux.fastq: @SampleA:002 TAAAAAAAAAAA SampleAA_1_demux.fastq: @SampleAA:001 GAAAAAAAAAAA rslt_aux/ unmatched_1_demux.fastq: @Seq8 CGTGTAAAAAAG
-
default_program
¶ alias of
Demux_Program_Chewbacca
Sequence Renaming¶
-
class
rename.Rename_Command.
Rename_Command
(args_)[source]¶ Renames sequences in a file with their sampleID and a serial ID#. Useful for simplifying complex naming systems into human-readable sequence names. In order to ensure the correct sample names are preserved, it is reccomended that this command be run immediately after the Demux Command.
- Inputs:
- A single fasta/fastq file or a directory containing multiple fasta/fastq files.
- Outputs:
- _renamed.<ext> file - A <fasta/fastq> file with the renamed sequences.
- .samples file - A .samples.
- .mapping file - A .mapping.
- Notes:
- In order for the .samples file to correctly list the sample name of the sequences in a file, this command should be run immediately after the Demux Command.
- The –clip parameter tells Chewbacca that trailing _<offset numebr> (from the demuxing command) should not be considered part of the sample name when naming sequences. By default this is set to True, and should be fine. If you notice parts of your sample names getting clipped off in your .samples file, you should explicitly set this parameter to False.
- Each input file will have a corresponding .samples, .mapping, and _renamed file.
- The .samples file is needed by downstream Chewbacca processes (Building the OTU Table).
- The .mapping file is purely for user convenience and record-keeping.
Example:
SampleA_0.fasta: @M03292:26:000000000-AH6AG:1:1101:22127:1256 AAAA @M03292:26:000000000-AH6AG:1:1101:22127:1257 AAAT
$ python chewbacca.py rename -i SampleA_0.fasta -o rslt
rslt/SampleA_0_renamed.fasta: @SampleA_ID0 AAAA @SampleA_ID1 AAAT rslt_samples/SampleA_0_renamed.samples: SampleA_ID0 SampleA SampleA_ID1 SampleA rslt_aux/SampleA_0_renamed.mapping: M03292:26:000000000-AH6AG:1:1101:22127:1256 SampleA_ID0 M03292:26:000000000-AH6AG:1:1101:22127:1257 SampleA_ID1
Adapter Removal¶
-
class
clean.Clean_Adapters_Command.
Clean_Adapters_Command
(args_)[source]¶ Removes sequencing adapters (and preceeding barcodes) from sequences in input file(s). Sequences should be in the following format:
<BARCODE><ADAPTER><SEQUENCE><RC_ADAPTER>.
Valid ADAPTER sequences, and their reverse-complements (ADAPTER_RC) should be defined separately in a pair of fasta-formatted files. Sequences passed to this command should have already been demultiplexed, as this process will remove the identifying barcode sequences.
- Inputs:
- One or more fasta/fastq files to clean.
- A single .adapters file
- A single .adapters_RC file
- Outputs:
- <filename>_debarcoded.<ext> file(s) - <fasta/fastq> files, containing sequences with their leading adapters, trailing adapters, and barcodes removed.
- Notes:
- Be aware of the program-specific details around ‘N’ nucleotide characters.
Example:
Given Data_ID#1 with barcode=AGACGC:
./ Data.fasta: @Data_ID#1 AGACGCGGWACWGGWTGAACWGTWTAYCCYCCATCGATCGATCGTGRTTYTTYGGNCAYCCNGARGTNTA Data.adapters: >1 GGWACWGGWTGAACWGTWTAYCCYCC Data.adaptersRC: >first TGRTTYTTYGGNCAYCCNGARGTNTA
$ python chewbacca.py trim_adapters -i Data.fasta -o rslt -a Data.adapters -arc Data.adapters_RC
rslt/ Data_debarcoded.fastq: @Data_ID#1 CATCGATCGATCG
-
default_program
¶ alias of
Clean_Adapters_Program_Flexbar
Quality Cleaning¶
-
class
clean.Clean_Quality_Command.
Clean_Quality_Command
(args_)[source]¶ Removes regions of low quality from fastq-formatted reads. These regions are likely sources of error, and would be detrimental to other analytical processes. Input sequences to this command should have already been demultiplexed, and had their barcodes/adapters removed. Otherwise, the partial removal of these markers would leave behind invalid partial fragments that would be difficult to detect demux or trim form barcode.
- Inputs:
- One or more fastq files to clean.
- Outputs:
- <filename>_cleaned.fastq file(s) - Fastq files, containing sequences with areas of low quality removed.
- Notes:
- Be aware of the program-specific details around ‘N’ nucleotide characters.
- Be aware of the program-specific defaults for minimum surviving sequence lengths.
Example:
./ Data.fasta: @Data_ID#1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCTTTACAG + !zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz%%%zzzz
The command below asks Chewbacca to trim away any section of length 3 NT in Data_ID#1 that has quality lower than 20, keeping the longer of the remaining ends. If the remaining sequence at the end of this process is shorter than 15 NT, discard the whole sequence (these values are chosen for illustrative purposes).
$ python chewbacca.py clean_seqs -i Data.fasta -o rslt -m 15 -w 3 -q 20
Note that the ‘TTT’ subsequence has been cut, because its average quality (5) is less than the threshold (20). After this cut, the longest remaining subsequence (the subsequence to the left of the cut) was kept, and the shorter subsequence (to the right of the cut) was discarded. Because the final sequence is longer than 15NT, it is kept and written to the output file.
rslt/ Data_cleaned.fastq: @Data_ID#1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTC + !zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
-
default_program
¶ alias of
Clean_Quality_Program_Trimmomatic
File Conversion¶
-
class
util.Convert_Fastq_Fasta_Command.
Convert_Fastq_Fasta_Command
(args_)[source]¶ Converts a Fastq-formatted file to a Fasta-formatted file. Useful for reducing data size and preparing for fasta-only operations.
- Inputs:
- A fastq file or a director conataining multiple fastq files .
- Outputs:
- <filename>.fasta file(s) - Converted fasta files.
Example:
./ Data.fastq: @Data_ID#1 AGACGCGGWACWGGWTGAACWGTWTAYCCYCCATCGATCGATCGTGRTTYTTYGGNCAYCCNGARGTNTA
``$ python chewbacca.py trim_adapters -i Data.fasta -o rslt ``
rslt/ Data.fasta: >Data_ID#1 AGACGCGGWACWGGWTGAACWGTWTAYCCYCCATCGATCGATCGTGRTTYTTYGGNCAYCCNGARGTNTA
Dereplication¶
-
class
dereplicate.Dereplicate_Command.
Dereplicate_Command
(args_)[source]¶ Dereplicates a fasta file by grouping identical reads together under one representative sequence. The number of duplicate/seed sequences each representative sequence represents is given by a ‘replication count’ at the end of the sequence name in output fasta file. If a .groups file is provided, then previous replication counts will be take in into account (e.g. Imagine a representative sequence X that represents 3 sequences. If X is found to be
identical to Y (no a seed for any other sequence) then the new cardinality, or replication count, of X becomes 4. Cardinality are denoted with asuffix of ‘_K’ on the sequence name, where K is the cardinality for the group that sequence represents.
- Inputs:
- One or more fasta files to dereplicate.
- Optional: .groups - A list of representative names and the names of their seed sequences. You likely have one of these files if you’ve previously run a clustering or dereplication command.
- Outputs:
- _counts.fasta file - A fasta file with unique sequences and their replication counts.
- _derep:ref:.groups - A list of representative names and the names of their seed sequences.
- Notes:
- This command only dereplicates within each fasta file (not across all files). This means a sequence in one file will be unique within that file, but might exist in another file. To ensure sequences are uniqe across an entire dataset, merge all fasta files into one file, then dereplicate that fasta file. It the fasta files each have group files, then make sure you merge those as well.
- Each input file will generate a corresponding _count file.
- If an input .groups file is not provided, then each input fasta file will generate a new groups file named <file_name>_derep.groups. If an input .groups file IS provided, then a single groups file named ‘dereplicated_updated.groups’ will be generated.
- The output .groups file is needed by downstream Chewbacca processes (Dereplication, Clustering, Building the OTU Table).
- The order of sequence names in the *_counts.fasta and .groups file is arbitrary.
Example:
./ Data.fasta >seq1 AAA >seq2 AAA >seq3 AAAG >seq4_3 AAAGT >seq7 AAAGT test.groups seq4 seq4 seq5 seq6
In the above example, test.groups indicates that seq4 is a sequence that has previously been identified as a representative (in some earlier round of clustering or dereplication).
$ python chewbacca.py dereplicate_fasta -i Data.fasta -o rslt -g test.groups
rslt/Data_counts.fasta: >seq4_4 AAAGT >seq1_2 AAA >seq3_1 AAAG rslt_groups_files/*.groups: seq3 seq3 seq1 seq2 seq1 seq4 seq7 seq6 seq5 seq4
Notice that Data_counts.fasta lists the unique sequences from Data.fasta, and their replication counts. Also notice that seq4 had previous replication data (stored in the Data.groups file).
-
default_program
¶ alias of
Dereplicate_Program_Vsearch
File Splitting¶
-
class
util.Partition_Command.
Partition_Command
(args_)[source]¶ A utility command that partitions a fasta/fastq file into a set of files (of the same file format), with a user-specified (maximum) number of sequences per file. Allows users to partition a large file into segments, and perform discrete operations in run_parallel over those segments.
- Inputs:
- One or more fasta/fastq files to partition.
- C: An integer defining the maximum number of sequences per file
- Outputs:
- <filename>_part_<part_#>.<ext> file(s) - <fasta/fastq> files, with at most C sequences per file.
Example:
./ Data.fq: @Data_ID1 GATTTGGGG + !zzzzzzzzz @Data_ID2 GATTTGGGG + !zzzzzzzzz @Data_ID3 GATTTGGGG + !zzzzzzzzz
$ python chewbacca.py convert_fastq_to_fasta -i Data.fq -o rslt/
rslt/ Data.fasta: @Data_ID1 GATTTGGGG @Data_ID2 GATTTGGGG @Data_ID3 GATTTGGGG
File Merging¶
-
class
util.Merge_Command.
Merge_Command
(args_)[source]¶ Concatenates multiple files into a single file. Useful for combining the results of a run_parallel operation, or when preparing for cross-sample derepication.
- Inputs:
- A set of files to merge.
- An <output_filename>.
- An <output_prefix>.
- Outputs:
- <output_filename>.<output_prefix> - A file consisting of all the input files concatenated together.
- Notes:
- The order of the content in the concatenated files is not guaranteed.
Example:
targets/ Data.fq: @Data_ID1 GATTTGGGG + !zzzzzzzzz Data2.fa: @Data_ID1 GATTTGGGG Blah.txt Hello World!
$ python chewbacca.py merge_files -i targets/ -o rslt/ -f txt -n Merged
rslt/ Merged.txt: Hello World! @Data_ID1 GATTTGGGG + !zzzzzzzzz @Data_ID1 GATTTGGGG
File Cleaning¶
-
class
util.Ungap_Command.
Ungap_Command
(args_)[source]¶ Removes target characters from a fasta/fastq file. Useful for removing gap characters from sequence alignments.
- Inputs:
- One or more fasta/fastq files to clean.
- A string of one or more gap characters to remove.
- Outputs:
- *_cleaned.<ext> file - A <fasta/fastq> file with gap characters removed from its sequences.
Example:
Data.fasta: >seq1 AAAAA.A*A-A
$ python chewbacca.py ungap_fasta -i Data.fasta -o rslt -f fasta -g ".*-"
rslt/Data.fasta: >seq1 AAAAAAAA
Deep Cleaning¶
-
class
clean.Clean_Deep_Command.
Clean_Deep_Command
(args_)[source]¶ Performs an intensive deep-cleaning of sequences to eliminate frameshifts, detect chimeras, and determine sequence orientation. Input files to this command should first be dereplicated. Doing so will reduce the total number of alignments required, and reduce computation time.
- Inputs:
- One or more fasta/fastq files to deep clean (nucleotide sequences).
- One reference fasta (nucleotide sequences).
- Outputs:
- *_AA - Amino Acid Alignment file, including reference sequences.
- *_log.csv - A log listing each input sequence, and deep cleaning results for each sequence.
- *_NT - Nucleotide Alignment file, including reference sequences.
- Notes:
- Sequences that do not meet quality cleaning standards are dropped.
- The output files contain reference sequences, and odd alignment characters. Both of these need to be removed by running the Clean_Deep_Repair Command.
Example:
Data.fasta BIOCODE.fa
$ python chewbacca.py macseAlign -i Data.fasta -o rslt -d BIOCODE.fa
rslt/Data_AA rslt/Data_NT rslt/Data_log.csv
-
default_program
¶ alias of
Clean_Deep_Program_Macse
Deep Cleaning Repair¶
-
class
clean.Clean_Deep_Repair_Command.
Clean_Deep_Repair_Command
(args_)[source]¶ - Cleans aligned files by removing gap characters and reference sequences from the file. Sequences passed to this
- command should have previously been aligned.
- Inputs:
- *_AA - Amino Acid Alignment file, including reference sequences.
- *_log.csv - A log listing each input sequence, and deep cleaning results for each sequence.
- *_NT - Nucleotide Alignment file, including reference sequences.
- Nucleotide reference fasta.
- * The original fasta files that were passed in to the Clean_Deep Command
- * The Nucleotide reference fasta that was passed to the Clean_Deep Command
- Outputs:
- *_MERGED.fasta - A clean fasta file with all the surviving sequences from deep cleaning.
- Notes:
- A single *_MERGED.fasta is generated regardless of the number of input files.
Example:
BIOCODE.fa originalData/Data.fasta input/ Data_AA Data_NT Data_log.csv
$ python chewbacca.py -i input/ -o out/ -d BIOCODE.fa -s originalData/
out/ MACSE_OUT_MERGED.fasta
-
default_program
¶ alias of
Clean_Deep_Repair_Program_Macse
Sequence Clustering¶
-
class
cluster.Cluster_Command.
Cluster_Command
(args_)[source]¶ Clusters a set of fasta files. This command generates a fasta file of unique sequences (each representing a cluster) and a .groups file. This command also takes an optional .groups file containing replication data from previous commands. If a .groups file is supplied, only one output .groups file is generated (regardless of the number of inputs).
- Inputs:
- One or more fasta files to cluster.
- Optional: .groups - A list of representative names and the names of their seed sequences. You likely have one of these files if you’ve previously run a clustering or dereplication command.
- Outputs:
- *.fasta file - A fasta file with unique sequences and their replication counts.
- *.groups - A .groups
- Notes:
- The input fasta file(s) should have been dereplicated before clustering. * For a single experiment with multiple fasta files, it is best to merge all input fasta files, dereplicate them, then cluster the single merged and dereplicated fasta file. This provides the best OTU groupings.
Example:
./ Data.fasta: >seq1_3 AAAAAAAAAA >seq2_1 ATAAAAAAAA >seq3_1 TTTTTTTTTT >seq4_1 TTTTTTATTT >seq5_1 TTTTTTATCT Data.groups: seq1 seq6 seq1 seq7
$ python chewbacca.py cluster_seqs -i Data.fasta -o rslt -g Data.groups
rslt/ Data_clustered_seeds.fasta: >seq1_4 AAAAAAAAAA >seq3_3 TTTTTTTTTT rslt_groups_files/ postcluster_updated.groups: seq3 seq3 seq5 seq4 seq1 seq2 seq1 seq7 seq6
OTU Table Construction¶
-
class
otu.Build_OTU_Table_Command.
Build_OTU_Table_Command
(args_)[source]¶ Builds an OTU table using a .groups, .samples, and .barcodes file. The OTU table shows OTU (group) abundance by sample.
- Inputs:
- One or more .samples.
- One or more .barcodes.
- one or more .groups.
- Outputs:
- matrix.txt - A tab-delimited table mapping OTUs (groups) to their abundance in each sample.
- Notes:
- A sequence name may not appear in more than one group file (or more than one line in a gropus file for that matter!).
Example:
./ test.barcodes Sample1 aaaaaa Sample2 aaaaat Sample3 aaaaac Sample4 aaaaag test.groups seq3 seq3 seq5 seq4 seq1 seq2 seq1 seq7 seq6 test.samples seq1 Sample1 seq2 Sample1 seq3 Sample1 seq4 Sample2 seq5 Sample2 seq6 Sample2 seq7 Sample3
$ python chewbacca.py build_matrix -b test.barcodes -g test.groups -s test.samples -o rslt/
rslt/ matrix.txt OTU Sample1 Sample2 Sample3 Sample4 seq3 1 2 0 0 seq1 2 1 1 0
OTU Identification¶
-
class
otu.Query_OTU_DB_Command.
Query_OTU_DB_Command
(args_)[source]¶ Aligns sequences in a fasta file against those in a reference database in order to determine OTU identity.
Only alignment based identification using vsearch is currenty available
- Inputs:
- One or more fasta files containing sequences to identify.
- A curated fasta file of high quality sequences and known species.
- A database containing taxonomic identifiers for sequences in the curated fasta file.
- Outputs:
- A .tax.
- Notes:
- The files COI.fasta and ncbi.db are included in the Chewbacca Docker distributions.
Example:
~/ARMS/refs/ COI.fasta # A precompiled fasta file of COI data from NCBI. >94483305 AGGACGGATCAGACGAAGAGGGGCGTTTGGTATTGGGTTATGGCAGGGGGTTTTATATTGATAATTGTTGTGATGAAATT GATGGCCCCTAAGATAGAGGAGACACCTGCTAGGTGTAAGGAGAAGATGGTTAGGTCTACGGAGGCTCCAGGGTGGGAGT ncbi.db # A precompiled database of (Taxa) for the entries in 'COI.fasta'. data/ Data.fasta: >seq1 GAATAGGTGTTGGTATAGAATGGGGTCTCCTCCTCCGGCGGGGTCGAAGAAGGTGGTGTTGAGGTTGCGGTCTGTTAGTAGTATAGTGATGCCAGCAG CTAGGACTGGGAGAGATAGGAGAAGTAGGACTGCTGTGATTAGGACGGATCAGACGAAGAGGGGCGTTTGGTATTGGGTTATGGCAGGGGGTTTTATA TTGATAATTGTTGTGAGGAAATTGATGGCCCCTAAGATAGAGGAGACACCTGCTAGGTGTAAGGAGAAGATGGTTAGGTCTACGGAGGCTCCAGGGTG GGAGTAGTTCCCTGCTAA
$ python chewbacca.py query_db -i Data.fasta -o out/ -r ~/ARMS/refs/COI.fasta -d ~/ARMS/refs/ncbi.db
rslt/ Data_result.out seq1 94483305 99.4 173 55.4 Chordata:Mammalia:Primates:Hominidae:Homo:Homo sapiens
-
class
otu.Query_OTU_Fasta_Command.
Query_OTU_Fasta_Command
(args_)[source]¶ Aligns sequences in a fasta file against those in a reference fasta in order to determine OTU identity.
- Inputs:
- One or more fasta files containing sequences to identify.
- A curated fasta file of high quality sequences and known species.
- A two-column, tab-delimited text file mapping sequence names in the curated fasta file to taxonomic identifiers.
- Outputs:
- A .tax.
- Notes:
- The files ‘bold.fna’ and ‘seq_lin.mapping’ are included in the Chewbacca Docker distributions.
Example:
~/ARMS/data/ bold.fna # A precompiled fasta file of data from BOLD. >GBMAA1117-14 GGGCTTTTGCGGGTATGATAGGAACAGCATTTAGTATGCTTATTAGGTTAGAACTATCTTCCCCAGGGTCTATGTTAGGAGATGATCATTTATATAAT GTTATAGTAACAGCTCATGCATTTGTAATGATATTTTTTTTAGTTATGCCAGTAATGATTGGGGGTTTTGGTAATTGGTTAGTACCTTTATATATTGG TGCCCCGGATATGGCTTTTCCTAGATTAAATAATATTAGTTTTTGGTTATTACCTCCGGCGCTTACTTTATTATTAGGTTCGGCTTTTGTAGAACAAG GGGCTGGGACAGGTTGGACAGTTTATCCGCCTTTATTTAGTATTCAAACTCATTCTGGGGGGTCTGTGGATATGGTAATATTTAGTTTACATTTAGCT GGAATATCTTCTATATTAGGGGCTATGAATTTTATAACAACAATCTTTAATATGAGGTCTCCGGGAGTAACTATGGATAGAATGCCTTTATTTGTTTG ATCTGTTTTAGTAACTGCTTTTTTATTATTATTATCATTGCCAGTATTAGCTGGTGCCATAACAAGTCTTTTAACCGATCGAGATTTTAATACTACAT TT seq_lin.mapping # A precompiled two-column tab file of (Taxa) for the entries in 'bold.fna'. GBMAA1117-14 Animalia;Porifera;Demospongiae;Haplosclerida;Phloeodictyidae;;Calyx;Calyx podatypa ./ Data.fasta: >seq1 ACTATCAGGCATTCAAGCCCATTCAGGGGGAGCAGTAGATATGGCTATATTTAGTCTACATCTAGCTGGTGTATCCTCTATTTTAAGTTCTATAAACT TTATAACTACTATAATTAATATGAGGGTTCCTGGGATGAGTATGCATAGATTACCTCTATTCGTATGGTCTGTATTAGTTACTACAATATTATTGTTG TTATCTTTACCAGTATTAGCTGGTGGAATTACAATGTTATTGACAGATAGAAATTTTAATACAACATTCTTTGACCCTGCGGGAGGAGGAGATCCTAT TTTATTCCAGCACTTATTT
$ python chewbacca.py query_fasta -i Data.fasta -o rslt -r ~/ARMS/data/bold.fna -x ~/ARMS/data/seq_lin.mapping
rslt/ Data_result.out seq1 GBMAA1117-14 90.6 265 84.7 Animalia;Porifera;Demospongiae;Haplosclerida;Phloeodictyidae;;Calyx;Calyx podatypa
OTU Annotation¶
-
class
otu.Annotate_OTU_Table_Command.
Annotate_OTU_Table_Command
(args_)[source]¶ Annotates an OTU table with taxonomic names by replacing sequence names in the OTU table with their identified taxonomies. Multiple OTU can annotated with the same taxonomic name – those are not combined.
- Inputs:
- An OTU_table to annotate.
- One or more .tax files to read annotations from.
- Outputs:
- An OTU_table with sequence names replaced by taxonomic names in the input .tax file.
- Notes:
- The input annotation file(s) should list only one identification per sequence name. If you find more than one taxonomic identity for a sequence, choose only one to include in the input .tax file(s).
Example:
./ matrix.txt OTU Sample1 Sample2 Sample3 Sample4 seq3 1 2 0 0 seq1 2 1 1 0 data.tax: seq1 94483305 99.4 173 55.4 Chordata:Mammalia:Primates:Hominidae:Homo:Homo sapiens
$ python chewbacca.py annotate_matrix -i matrix.txt -a data.tax -o rslt
rslt/ matrix.txt OTU Sample1 Sample2 Sample3 Sample4 seq3 1 2 0 0 Chordata:Mammalia:Primates:Hominidae:Homo:Homo sapiens 2 1 1 0