Log Parsing Details¶
This document explains the internal implementation of lammps-logfile’s parsing logic. It is useful for developers who want to contribute to the core reader or understand the performance characteristics.
Implementation Overview¶
The parsing logic (found in lammps_logfile.reader.read_log) uses a block-based approach to handle the variability of LAMMPS log files. Instead of assuming a fixed structure, it scans the file for specific “start” and “stop” keywords that delineate thermodynamic data blocks (the output of a run command).
Block Detection¶
The detection of a valid run block relies on recognizing start strings typical of LAMMPS output:
start_strings = [
"Memory usage per processor",
"Per MPI rank memory allocation"
]
When one of these strings is encountered, the parser enters a “capture mode”. It collects lines until a stop string is found:
stop_strings = [
"Loop time",
"ERROR",
"Fix halt condition"
]
This robustly isolates the thermodynamic data from other log output (e.g., potential energy initialization, neighbor list builds).
Handling Run Styles¶
LAMMPS allows for different thermo_style settings, which produce different output formats. lammps-logfile automatically detects and handles these.
thermo_style custom (Default)¶
In the custom style (or one), the data is preceded by a header line containing the column names (e.g., Step Temp Press).
Logic: The parser accumulates the block lines and passes them to
pandas.read_csvusing the C-based engine.Separator: Since LAMMPS uses whitespace,
sep=r'\s+'is used.
thermo_style multi¶
In the multi style, data is output across multiple lines per timestep, separated by a header:
------------ Step 0 ----- CPU = 0.0000 (sec) -------------
TotEng = -5.2737 KinEng = 1.4996
...
Detection: The parser checks the first 10 lines of a captured block. If a line starts with
------------ Step, it identifies the block asmultistyle.Parsing: A specialized function
_parse_multi_styleiterates through the lines. It extracts theStepandCPUfrom the separator line and parses the followingKey = Valuepairs into a dictionary for that timestep.
Performance Optimization¶
Parsing large text files requires careful optimization. lammps-logfile achieves >100 MB/s effective parsing speeds through:
Memory-Mapped Scanning: The library uses
mmapto scan the file for data blocks directly on disk, avoiding the need to load the entire text file into Python memory.Zero-Copy Slicing: Identified data blocks are passed as memory views directly to
pandas.read_csv, minimizing data copying.Pandas C-Engine: The heavy numeric parsing is handled by the optimized Pandas C-backend.
[!NOTE] Minimal Memory Footprint: Unlike naïve parsers that read the whole file (
readlines()), this implementation’s memory usage is primarily determined by the size of the output data, not the input text file.
Completeness & Accuracy¶
Multiple Runs: The parser maintains a
run_numcounter. Each successfully parsed block increments this counter and receives arun_numcolumn. The blocks are then concatenated into a single DataFrame.Missing Columns: If a log file switches
thermo_stylebetween runs (e.g., adding a new compute), the final DataFrame will contain the union of all columns. Timesteps where a column was not present will haveNaNvalues, ensuring no data is discarded.