• C++ Programming for Financial Engineering
    Highly recommended by thousands of MFE students. Covers essential C++ topics with applications to financial engineering. Learn more Join!
    Python for Finance with Intro to Data Science
    Gain practical understanding of Python to read, understand, and write professional Python code for your first day on the job. Learn more Join!
    An Intuition-Based Options Primer for FE
    Ideal for entry level positions interviews and graduate studies, specializing in options trading arbitrage and options valuation models. Learn more Join!

regex for parsing dimension measurement descriptions (Python)

Joined
11/5/18
Messages
303
Points
53
595c7e256f8551ac1cb8cee3624d32d8.png


83b4788c161aa302db80eeedaef9208b.png


Essentially, for rows whose work_height, work_width, work_depth dimensions are missing but there's a description of those dimensions in the work_dimensions column, I want to parse the said description into the work_height, work_width, work_depth columns. There are a few types of structures available based on my exploration:

  • __ unit x __ unit x __ unit e.g. 200 x 300 mm. This one should be easy.
  • __ unit x __ unit \newline __ unit x __ unit, e.g. 200 x 300 mm\n400 x 760 mm I believe these are two different image dimension settings possible for the same image. I want to create a new image item (row) with the second setting (or third or whatever).
  • The written out mixed fractions, e.g. 16 7/8 in (42.8 cm) or 16 7/8in (42.8cm). How is this supposed to be parsed? This is one of the hard ones. Since the unit column work_measurement_unit is generally mm, that's the unit to parse I presume (and even then I have to convert from cm to mm).
  • Measurement Description, followed by the mixed fraction and other unit in parentheses above, i.e. Diameter: 19 3/7 in (72.5 cm).
To access the rows above I used:

[CODE lang="python" title="code to get missing data"]mask = (df['work_dimensions'] != '-1') & (df['work_dimensions'].notnull()) & ((df[['work_height','work_width','work_depth']] == -1.0).sum(axis=1) == 3)
df[['work_dimensions','work_height','work_width','work_depth','work_measurement_unit']][mask][/CODE]

I'm not too familiar with regexp stuff in Python or in general so any help would be appreciated!
 
There are two kinds of developer; those that know regex (Perl?) and them that don't.
It's a special area indeed.
 
Back
Top