scipyの疎行列(Sparse matrix)をmatplotlibで描画する

サンプルデータのダウンロード

import pyodide import os import tarfile URL = "https://data.arakaki.tokyo/static/blog/20220301/" FILES = [ "bcsstm22.tar.gz", "lns_3937.tar.gz", "orani678.tar.gz", "memplus.tar.gz" ] paths = list() for file in FILES: # 疎行列データ(tar.gz)ダウンロード with open(file, "w") as f: f.write(pyodide.open_url(f'{URL}{file}').read()) # tarファイルの展開 tar = tarfile.open(file) for path in tar.getnames(): paths.append(path) tar.extract(path) print(path)import matplotlib.pyplot as plt import scipy.io as scio

Matrix Market filesの読み込み

まずはscipyでMatrix Market(MM) fileを読み込みます。

Input and output (scipy.io) | Matrix Market files
- scipy.io.mminfo
- scipy.io.mmread

scipyの疎行列クラスはいくつか種類がありますが、MMファイルで疎行列を表すフォーマットは座標形式(COO)のようなので、coo_matrixオブジェクトとして読み込まれます。

# MMファイルを読み込んでdataにまとめる data = dict() for p in paths: m = scio.mmread(p) # 正方行列だけ扱う if m.shape[0] == m.shape[1]: name, ext = os.path.splitext(os.path.basename(p)) data[name] = m data

疎行列取り扱い注意❗

で、とりあえずこの疎行列がどんな形なのか確認したくなるわけですが、普通にmatplotlibで描画しようにも以下のような問題に直面します。

判読できない縮尺になる
安易にndarray化するとメモリを食い尽くす

縮尺

bcsstm22とorani678を例に見てみましょう。
縦横のサイズが100程度のbcsstm22は程よく描画されていますが、2500を超えるorani678は単色で塗りつぶされてしまっています。

fig, axs = plt.subplots(1, 2, figsize=[9, 3.5]) for ax, k in zip(axs, ["bcsstm22", "orani678"]): ax.set_title(k) im = ax.imshow(data[k].toarray(), cmap="Blues") fig.colorbar(im, ax=ax) fig.tight_layout() plt.show()

メモリ使用量

上の時点でもうお手上げなんですが、ダメ押しでndarray化した場合のメモリ使用量も確認してみます。

def size(v): 'format size in human-readable' unit = ["B", "KB", "MB", "GB"] for u in unit: if(v > 1024): v /= 1024 else: return f'{v:5.1f} {u:2}'

sparse(coo)インスタンスの占有メモリ

for k,v in data.items(): print(f'{k:10}: {size(v.data.nbytes + v.row.nbytes + v.col.nbytes)}')

ndarray換算にすると

foo = data[k].dtype for k,v in data.items(): print(f'{k:10}: {size(v.dtype.itemsize * v.shape[0] * v.shape[1])}')

ブラウザ上ではmemplusをndarray化しようとするだけでエラーが出てしまいます。
通常のPython環境でもndarrayで扱おうとするとなかなか困難な戦いを強いられます。

# エラー data["memplus"].toarray()

解決策：`spy`を使う

単に形を確認するだけならpyplot.spy一発です。
markersizeで粒度を指定できます。

fig = plt.figure(figsize=[10,10]) for i, (k, m) in enumerate(data.items()): ax = fig.add_subplot(2, 2, i+1) ax.spy(m, markersize=1) ax.set_title(k) fig.tight_layout() plt.show()

より詳細に確認したければfigsizeを大きく、markersizeを小さくすればよいでしょう。

fig, ax = plt.subplots(figsize=[10,10]) ax.spy(data["orani678"], markersize=.1) plt.show()

以上❗

サンプルデータのダウンロード

Matrix Market filesの読み込み

疎行列取り扱い注意❗

縮尺

メモリ使用量

解決策：spyを使う

解決策：`spy`を使う