Scripts Index
Related Topics
Popular Trends
Trending Topics
Python File Management  

Parsing binary files with regular expressions

download download home home   report broken
important script information
company name:
code.activestate.com
license: Free
minimum requirements: Python
functional limitations:
Parsing binary files with regular expressions description


This script allows you to use the regular expression engine to parse binary files, especially those for which the struct module alone is inadequate.The typical way to parse binary data in Python is to use the unpack method of the struct module. This works well for fixed-width fields, but becomes more complicated when you need to parse variable-width fields. Perl's implementation of unpack accepts "*" as the field length, and even allows grouping with parentheses, which mitigates this problem. Python does not currently offer these features. Although you can dynamically generate a format string for unpack with a lot of slicing and calls to calcsize, the resulting code will likely be hard to read and error-prone.Fortunately, in some cases there is a simpler way to do it: use the regular expression engine to grab each field, and use struct.unpack on the results.First, you construct a regular expression (RE) describing the entire record structure, grouping each field you'd like to extract with parentheses, and compile it. To create the regular expression, you just have to remember that one character in the RE equals one byte in the record. So, the expression ".." would match any short (2 bytes). To match a variable-width field, the REengine will have to be able to recognize where the field ends. In a null-terminated string, for example, the field ends with a zero byte. You'd therefore look for any number of characters followed by a null byte: "(.*?)". Note the use of the non-greedy qualifier "?" -- this way, we only match up to the first null, rather than the last null in the buffer.When compiling, make sure to pass the re.DOTALL flag to the compiler, or it will consider bytes that happen to match ASCII '' to be newlines. Then, you use the findall method of the compiled expression object on your buffer. findall finds all non-overlapping matches, one match for each record. It returns a list of tuples, one for each match; each tuple will contain one element for each field you grouped in the RE.You still need to unpack the fields in the tuples before using them, since they're still strings rather than usable values. Generally, you'll call unpack once for each field, with only one format character. (You can also group multiple consecutive fixed fields in one set of parentheses in the RE, and then unpack them in one call. But that may get confusing.)The code above demonstrates how to unpack a binary file that has an indeterminate number of variable-width records, each consisting of a little-endian short, a null-terminated string, and two more shorts. It drops the resulting values into a list and also into a dictionary.This technique is useful when your variable-width fields are terminated with a sentinel, such as the zero-terminated strings described above. If your field length is embedded in the data, and you can't use the "p" (Pascal string) modifier, you'll probably have to resort to slicing the buffer up manually.This technique is also applicable even if your fields are all fixed-width. The findall method will operate on the entire buffer at once with a single regular expression, which saves you from having to dynamically create a long format string encapsulating all your data, or alternatively iterating over slices of the buffer.



Relates:
Files - Binary - Parser - Files Management - Regular Expressions - Binary Files Parser
Similar scripts
Convert PDF to TIFF (Popularity: ) : This script is a very short code snippet illustrating how to convert individual pages of PDF documents to TIFF files, one TIFF file per page. It works only on Mac OS X with PyObjC installed.As a recipe the code is ...
Count PDF pages (Popularity: ) : Count PDF pages script is a simple way to count the pages of a PDF the pure Python way.
Counting pages of PDF documents on Mac OS X (Popularity: ) : Given that PDF is a "native" data format on Mac OS X, it is very easy to get access to some properties of such documents. One is the number of pages. Using Python the necessary code to do this is ...
Iterate over .MP4 atoms (Popularity: ) : This script yields the atoms contained in an MP4 file. Mostly is used for extracting the tags contained in it (artist, title etc) using a convenience class (M4ATags). This script could be implemented as an generator.
Counting pages of PDF documents on Mac O (Popularity: ) : Given that PDF is a "native" data format on Mac OS X, it is very easy to get access to some properties of such documents. One is the number of pages. Using Python the necessary code to do this is ...
Disk (Popularity: ) : This script provides a simple simulation of secondary memory and is primarily designed to provide a driver interface to a virtual hard drive. The interface is simple and allows the simulation of IO errors. Also provided are methods that allow ...
Cross Platform Excel Parsing With Xlrd (Popularity: ) : This script easily extract data from microsoft excel files using this wrapper class for xlrd. The class allows you to create a generator which returns excel data one row at a time as either a list or dictionary. This script ...
A Singleton log file creator (Popularity: ) : This class is a basic Singleton log file creator. It allows separate classes/modules to log their activities to the same file (even the same line if they want to).This is a quite basic log file creator, intended to assist in ...
Backup your files (Popularity: ) : Backup your files script makes backup versions for your files.It can be used for non-python source code also.
A Python script to test download mirrors (Popularity: ) : The concept of the script is straightforward: read the mirrors page from RedHat's web site, make a list of all the mirrors, test how long it takes to download from each, and present a sorted list of the results.

The first ...

User reviews

Write a review:
1 2 3 4 5 6 7 8 9 10
1=poor 10=excellent
Write review*
Your name*
Email*
  (Comments are moderated, and will not appear on this site until the editor has approved them)
 
Similar Software
Batch RegEx (Popularity: ) : Replace, format, and extract text in multiple files using Regular Expressions. Perform GREP-like tasks including search and replace, RegEx substitutions, data extraction, and more! Built-in RegEx editors support color syntax highlighting and contextual tooltips making it easy to design patterns. ...
RegexBuddy (Popularity: ) : Perfect companion for working with regular expressions. Easily create regular expressions that match exactly what you want. Clearly understand complex regexes written by others. Quickly test any regex on sample strings and files, preventing mistakes on live data. Use the ...
TextCrawler (Popularity: ) : A tool for searching and replacing across multiple text files. Supports regular expressions and provides an expression tester and library facility. It also features an interactive file list and highlighted search results. Freeware.

Features:
* Fast searching, even on large ...

HexEditXP (Popularity: ) : HexEditXP is a professional hex and structure editor for editing binary files. Being a flexible and fast hex editor, it has a powerful built-in scripting engine which is used to run scripts that can parse binary files into hierarchical data ...
THE Rename (Popularity: ) : Rename files and folders, pictures with their width and height and EXIF tags. Rename MP3, VQF, OGG and WMA files. Possbility to export tags from musical and pictures files. Rename files with regular expressions. Works separately on either the prefix ...
SQLRegEx: SQL Server Regular Expressions (Popularity: ) : SQLRegEx adds regular expression capabilities to Microsoft SQL Server 2000. With regular expressions, a person skilled in Transact-SQL can perform a vast number of data manipulation tasks which previously would have required complicated code using T-SQL string functions, or a ...
MoveIt2 Lite (Popularity: ) : MoveIt2 is a utility to automatically move / copy / delete files that are added to a specific folder. You can set filters based on the file name, rename files (using regular expressions) when moving them to the new destination. ...
vvvSoft myDevStudio (Popularity: ) : Main Features:- Syntax Highlight&Folding:* C/C++/C#/Recouse/Java/JS* VB/VBS* HTML/XML/CSS*...- MDI editor with TabControl- Support for Projects- BuiltIn Explorer- Compile/Build/Go- Export to HTML- rich HTML+JS/ASP/JSP Highlight- Integration (WinCmd,FAR,Shell)- Find/Repace/in Files/with Regular Expressions- Line Numbers, White Space, Line Endings- Copy to Clipboard as HTML- ...
Multilizer Lite for Documents 2009_ (Popularity: ) : Multilizer Lite for Documents is an easy-to-use tool for localizing documents in the most common document formats. Multilizer Lite for Documents enables localization of typical text documents, such as HTML (.html, .htm, .php, .asp, etc.) including embedded scripts (JScript for ...
Finders Keepers (Popularity: ) : Search files of any kind, replace text, index files for instant searches, and launch files for viewing or editing. Search 4 ways: plain-text, regular expressions, approximate, and sound-alike. View, edit, or launch found files 6 ways, e.g., by associated files, ...
ad


Rate me
supported os
Windows, Linux, Mac OS, BSD, Solaris
stats
downloads 3
version 1.0
size in Kb
popularity   
858/377158
user rating 5/10
New Scripts
Popular Scripts
Latest Reviews