I'm running a model using GLM (using ML in Spark 2.0) on data that has one categorical independent variable. I'm converting that column into dummy variables using `StringIndexer`

and `OneHotEncoder`

, then using `VectorAssembler`

to combine it with a continuous independent variable into a column of sparse vectors.

If my column names are `continuous`

and `categorical`

where the first is a column of floats and the second is a column of strings denoting (in this case, 8) different categories:

```
string_indexer = StringIndexer(inputCol='categorical',
outputCol='categorical_index')
encoder = OneHotEncoder(inputCol ='categorical_index',
outputCol='categorical_vector')
assembler = VectorAssembler(inputCols=['continuous', 'categorical_vector'],
outputCol='indep_vars')
pipeline = Pipeline(stages=string_indexer+encoder+assembler)
model = pipeline.fit(df)
df = model.transform(df)
```

Everything works fine to this point, and I run the model:

```
glm = GeneralizedLinearRegression(family='gaussian',
link='identity',
labelCol='dep_var',
featuresCol='indep_vars')
model = glm.fit(df)
model.params
```

Which outputs:

DenseVector([8440.0573, 3729.449, 4388.9042, 2879.1802, 4613.7646, 5163.3233, 5186.6189, 5513.1392])

Which is great, because I can verify that these coefficients are essentially correct (via other sources). However, I haven't found a good way to **link these coefficients to the original column names**, which I need to do (I've simplified this model for SO; there's more involved.)

The relationship between column names and coefficients is broken by `StringIndexer`

and `OneHotEncoder`

. I've found one fairly slow way:

```
df[['categorical', 'categorical_index']].distinct()
```

Which gives me a small dataframe relating the the string names to the numerical names, which I think I could then relate back to the keys in the sparse vector? This is very clunky and slow though, when you consider the scale of the data.

Is there a better way to do this?