Source code for demux.Demux_Name_Command

from classes.ChewbaccaCommand import ChewbaccaCommand
from Demux_Name_Program_Chewbacca import Demux_Program_Chewbacca
from Demux_Barcode_Program_Fastx import Demux_Program_Fastx


[docs]class Demux_Name_Command(ChewbaccaCommand): """Given a set of files, each file is assigned a file offset. Each file is then split into separate child files where each file holds only sequences belonging to a single sample. These child files are named using the sample name for the sequences it lists, and the file offset of the file it came from. Demuxing is based on unique sample names contained in sequence names. **Inputs**: * One or more fasta/fastq files to demux. Sequences in these files should contain as a prefix the sample they came \ from. (This is untested) * A single .barcodes file: A :ref:`.barcodes`, listing samples as they appear in sequence names, but actual \ barcode sequences can be made up. This command will only make use of barcode names. **Outputs**: * <sample_name>_<offset>_ demux.<ext> file(s) - <fasta/fastq> files, containing all the sequences from file \ <file_id#>, which had a sequence name containing sample <sample_name>. * unmatched_<offset>_ demux.<ext> file(s) - <fasta/fastq> files, containing sequences from file \ <file_id#>, whose barcode did not match any of those listed in the .barcodes file. **Notes**: * The assignment of offset to file should be treated as an arbitrary process and should not used for record \ keeping. * Each input file will generate its own unmatched_* file (if applicable). **Example**: :: data/ Data1.fasta: @SampleA:001 AAAAAAAAAAAA @SampleAA:002 AAAAAAAAAAAT @SampleA1:003 AAAAAAAAAAAC @Sample_B:001 AAAAAAAAAAAG Data2.fasta: @SampleAA:001 GAAAAAAAAAAA @SampleA:002 TAAAAAAAAAAA @Seq8 CAAAAAAAAAAA ./ Data.barcodes: SampleA AAA SampleAA AAA Sample_B AAA ``$ python chewbacca.py demux_names -i data/ -b Data.barcodes -o rslt`` Here, we see that Data1.fasta was assigned '0' as an offset, while Data2.fasta was assigned '1' as an offset. Because \ both files had sequences from SampleA, the sequences from Data1.fasta were written to SampleA_0_demux.fastq, \ and those sequences from Data2.fasta were written to SampleA_1_demux.fastq. The same is true for SampleB. :: rslt/ SampleA_0_demux.fastq: @SampleA:001 AAAAAAAAAAAA @SampleA1:003 AAAAAAAAAAAC SampleAA_0_demux.fastq: @SampleAA:002 AAAAAAAAAAAT SampleB_0_demux.fastq: @Sample_B:001 AAAAAAAAAAAG SampleA_1_demux.fastq: @SampleA:002 TAAAAAAAAAAA SampleAA_1_demux.fastq: @SampleAA:001 GAAAAAAAAAAA rslt_aux/ unmatched_1_demux.fastq: @Seq8 CGTGTAAAAAAG """ supported_programs = [Demux_Program_Chewbacca] default_program = Demux_Program_Chewbacca command_name = "Demux"