The news about Google open sourcing its new “AI” driven file format identification tool magika made a splash in the usual tech places recently. This post provides a very quick look at just one file through the lens of three file format identification tools, and gestures a bit about what we are giving up when we give in to big tech’s machine learning models.

Andy Jackson from the Digital Preservation Coalition has a good post about how magika is quite limited in terms of the formats it identifies, and the types of information it reports. He also points out that it’s important to remember that Google created magika to help route files to specialized security scanners in Gmail and Google Drive, which is quite different from digital preservation use cases. In digital preservation the concern is usually around mitigating perceived obsolescence of file formats, and also determining what applications can be used to render the file, both of which require knowledge not just of the format but also its version.

So, here’s a quick comparison of looking at one TIFF file using magika, the venerable file Unix command, and siegfried, which is a specialized tool developed by and for the digital preservation community. Think of this as a close reading of tools for file format identification, to try to discover or illustrate something significant in the details of their output, rather than a statistical overview of what the tools do more generally.

$ magika MCE_AF2G_2010.tif
MCE_AF2G_2010.tif: TIFF image data

Good job magika, it is a TIFF file.

$ file MCE_AF2G_2010.tif
MCE_AF2G_2010.tif: TIFF image data, little-endian, direntries=18, height=3724, bps=230, compression=LZW, PhotometricIntepretation=RGB, width=7460

Awesome file, thanks for the extra information about the dimensions, compression, and colors.

$ sf MCE_AF2G_2010.tif
---
siegfried   : 1.11.0
scandate    : 2024-02-20T18:01:54-05:00
signature   : default.sig
created     : 2023-12-17T15:54:41+01:00
identifiers :
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V116.xml; container-signature-20231127.xml'
---
filename : 'MCE_AF2G_2010.tif'
filesize : 1484121
modified : 2024-02-08T10:07:46-05:00
errors   :
matches  :
  - ns      : 'pronom'
    id      : 'fmt/155'
    format  : 'Geographic Tagged Image File Format (GeoTIFF)'
    version :
    mime    : 'image/tiff'
    class   : 'GIS, Image (Raster)'
    basis   : 'extension match tif; byte match at 0, 186 (signature 1/4)'
    warning :

Niiice siegfried, this is important! The TIFF file isn’t just an any image file, it’s a GeoTIFF file. If we were to open the TIFF file in a regular image viewer like Preview on MacOS we’d see this:

But since we know it’s a GeoTIFF we can also view it in GIS software like QGIS:

And we can use other tools like gdalinfo to look at metadata in the file:

➜  tmp gdalinfo x.tif
Driver: GTiff/GeoTIFF
Files: x.tif
Size is 7460, 3724
Coordinate System is:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        MEMBER["World Geodetic System 1984 (G2139)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]
Data axis to CRS axis mapping: 2,1
Origin = (56.249996410171626,38.166360030600401)
Pixel Size = (0.002160683455290,-0.002160683455290)
Metadata:
  AREA_OR_POINT=Area
  DataType=Thematic
Image Structure Metadata:
  INTERLEAVE=PIXEL
Corner Coordinates:
Upper Left  (  56.2499964,  38.1663600) ( 56d14'59.99"E, 38d 9'58.90"N)
Lower Left  (  56.2499964,  30.1199748) ( 56d14'59.99"E, 30d 7'11.91"N)
Upper Right (  72.3686950,  38.1663600) ( 72d22' 7.30"E, 38d 9'58.90"N)
Lower Right (  72.3686950,  30.1199748) ( 72d22' 7.30"E, 30d 7'11.91"N)
Center      (  64.3093457,  34.1431674) ( 64d18'33.64"E, 34d 8'35.40"N)
Band 1 Block=7460x1 Type=Byte, ColorInterp=Red
  Mask Flags: PER_DATASET ALPHA
Band 2 Block=7460x1 Type=Byte, ColorInterp=Green
  Mask Flags: PER_DATASET ALPHA
Band 3 Block=7460x1 Type=Byte, ColorInterp=Blue
  Mask Flags: PER_DATASET ALPHA
Band 4 Block=7460x1 Type=Byte, ColorInterp=Alpha

So if we used magika we never would have known we could put the image on a map.

… dramatic pause …

But perhaps even more important is what we are giving up when we rely entirely on a machine learning model, like what comes with magika, instead of the hand crafted rules used by file and siegfried.

  1. We lose the ability to reason about the output. Why was one format chosen and not the other?
  2. We lose the ability to update the tool to recognize new formats or to correctly choose other ones.

When you install the Python version of magika you get a few files added to your Python environment:

├── __init__.py
├── cli
│   └── magika.py
├── colors.py
├── config
│   ├── content_types_config.json
│   └── magika_config.json
├── content_types.py
├── logger.py
├── magika.py
├── models
│   └── standard_v1
│       ├── model.onnx
│       ├── model_config.json
│       ├── model_output_overwrite_map.json
│       └── thresholds.json
├── prediction_mode.py
├── strenum.py
└── types.py

The model.onnx file is what is used to determine what format a file is in. It was generated by developers at Google who used large amounts of data that they have, because they’re Google, and (presumably) a compute cluster, because they’re Google. It’s a binary blob that you can’t edit, or read as a human.

file and siegfried on the other hand use hand crafted databases of “magic codes” to look for in order to determine what format is likely for a file. Lots of time and effort have gone into creating and maintaining them. The rules aren’t perfect, or complete, but we do know how to update and fix them.

We don’t know all the details yet about how Google built their model, but it’s quite unlikely that the dataset they used is going to be made publicly available. I guess it’s possible (go on Google I dare you), but even if they did you will need (potentially a lot) of compute resources to be able to run the modeling itself. This means that not anyone can build this “opensource” tool. The model data is available, but the way to create it is not. It’s basically like making binary executables available without the source code.

If you find a new file format, or notice that a file format isn’t being recognized correctly you won’t be able to fix it, because fixing it involves tuning the machine learning algorithm that was used, and running it on an augmented dataset, which you don’t have access to.

Funnily enough, siegfried and file have no idea what the file format for model.onxx is. But magika says it’s a “Python compiled bytecode (executable)”. After experimenting a little bit it does seem like magika is able to distinguish programming language source code and executables quite a bit better than traditional tools.

So perhaps the reality we are in is that it might be useful to have multiple perspectives on file formats, and that running multiple tools could have uses. However the digital preservation community should be careful not to throw the baby out with the bath water. It’s important that we are able to maintain our tools, and be able to understand why they behave the way they do.


Update 2024-02-22: V pointed out to me that it should be possible to fine tune to the magika model, without requiring access to the original corpus that it was trained on, and the compute infrastructure that was used. This sounds promising, and I would actually really like to be proven wrong here. But I remain concerned that while fine tuning might be achievable, adding new file formats could prove difficult, or at least beyond the ken of digital preservationists. I’m not against learning new things (this old dog can still be taught new tricks), but replacing domain expertise in people’s brains for what’s in ML engineer’s brains is a real transfer of power that is underway at the moment. Perhaps it has been underway for decades under the guise of automation more generally (not just machine learning), but that is a topic for another post…