Proteome Informatics Group > Java Proteomic Library
 

Generic MS Parsers

Overview

Our MS readers (specific and generic) can parse files of many formats like mgf, mzxml or mzml.

These parsers have the main advantage of handling huge data file through iterators and filters.

It is also possible to handle bunches of spectra in the fly in the limit of available memory.

Specific Parsing

We provide a few specific MS parsers that may be create explicitly and that parse specific file format (dta, mgf, mzxml, ...):

import org.expasy.jpl.io.ms.reader.MZXMLReader;
import org.expasy.jpl.io.ms.MassSpectrum;

// build the specific parser
MZXMLReader reader = MZXMLReader.newInstance();

reader.parse(mzxmlFile);

// get the iterator over spectrum
Iterator<MassSpectrum> it = reader.iterator();
	
while (it.hasNext()) {
	// get next spectrum
	it.next();
}

Generic Parsing

We also propose a more generic way to parse MS spectra:

import org.expasy.jpl.io.ms.reader.MSReaderFacade;

// build a new generic MS parser
MSReader reader = MSReaderFacade.newInstance();
      
reader.parse(file);

// get infos on run
reader.getExperimentInfos();

// get the iterator over spectrum
Iterator<MassSpectrum> it = reader.iterator();

while (it.hasNext()) {
	// get next spectrum
	it.next();
}

You can also limit the kind of file this generic parser may parse:

// this reader is only able to read ms spectra library 
reader = MSReaderFacade.withExtension(Pattern.compile("msp|sptxt",
	 Pattern.CASE_INSENSITIVE));

// the following call will throw a ParseException as mgf format
// is not readable by this parser
reader.parse("test.mgf");

Parsing With Filters

Filters (through JPLICondition) can be defined and coupled to any MS parsers to select spectra to keep:

import org.expasy.jpl.core.util.condition.Condition;
import org.expasy.jpl.core.util.condition.ConditionImpl;
import org.expasy.jpl.core.util.condition.operator.OperatorLowerThan;

...
      
// a ms level filter on spectrum
Condition<MassSpectrum> msLevelFilter(int level) {
	// show how to access mslevel from a spectrum
	Transformer<MassSpectrum, Integer> sp2level =
		new Transformer<MassSpectrum, Integer>() {
	
			public Integer process(MassSpectrum sp) {
	        	return sp.getPeakList().getMsLevel();
	        }
	};
	      
	return new ConditionImpl.Builder<MassSpectrum, Integer>(level)
		.accessor(sp2level).build();
}

...
	
reader.setFilter(msLevelFilter(1));
	
reader.parse(file);
	
Iterator<MassSpectrum> it = reader.iterator();

while (it.hasNext()) {
	// get next MS1 spectrum
	it.next();
}

Look conditions in jpl-commons for more informations.

Progress Bar Ready

All Progression bar that implements JPLProgressBar can be given to any MS parser:

// a generic parser
MSReader parser = MSReaderFacade.newInstance();
		
// new terminal progress bar
TerminalProgressBar pb =
	TerminalProgressBar.indeterminate();

// set the progress bar length
pb.setBarLength(20);

// set the roaming segment length
pb.setSegmentLength(12);

parser.setProgressBar(pb);

parser.parse(new File(paramManager.getFilename()));

Iterator it = parser.iterator();
		
while (it.hasNext()) {
	// each entry parsed internally
	// increments the step in the progress bar
	it.next();
}

//  0   [============        ]
// ..   [       ============ ]
// 1000 [    ============    ]
// 5949 [      ============  ]
// ..
// 5979 [       ============ ]
// ..
// 9208 [====================]
// task finished
Note
TerminalProgressBar and JProgressBarAdapter all implements ProgressBar.

Block Parsing

Another mode of parsing is available in the new version. It consists of returning blocks of spectra at once:

reader.parse(file);

Iterator<MassSpectrum> it = reader.iterator();

// the number of spectra max in a block (depending on
// allocated memory)
reader.getNextToListMaxSize();

// get the next block
List<MassSpectrum> l = it.nextToList(245);

while (it.hasNext()) {
	// get the next bunch of spectra
	l = it.nextToList();
}
      

Calls to nextToList() may pose some kind of memory issues when executed on large files. Ms-readers deal with this problem by automatically estimating and controlling the limit in the size of each returned list.

Hopefully, the memory management for those who would like to tune the block parameters can also be handled easily:

// set the maximum part of total free memory dedicated to the 
// nextToList block for all ms-readers (optional)
reader.setMaxMemoryRatio(0.5);

// set the max memory in bytes for a block (100 KB)
reader.setNextToListMemoryLimit(100000);

reader.parse(file);

Iterator<MassSpectrum> it = reader.iterator();

// get the next block
List<MassSpectrum> l = it.nextToList();

// comparing with the max number of spectra in a block equivalent 
// to a size of 100 KB 
AssertTrue(l.size() <= reader.getNextToListMaxSize());
Note
The memory test is enabled by default, it can be disabled from method "enableMemoryTest(false)".