[TFX 스터디] 3. Transform

steelbear 2023. 10. 31. 01:25

이번에는 Transform Component를 건드려보았다. Transform에서는 ExampleGen에서 Example을 가져와 Hugging Face Transformers의 Tokenizer를 통해 토크나이징을 하고자 한다.

먼저 preprocessing_fn() 함수를 정의한 transform.py를 만든다.

import tensorflow as tf
import tfx.v1 as tfx
from transformers import AutoTokenizer


MODEL_NAME = "beomi/KoAlpaca-llama-1-7b"

TEXT_FEATURES = {
    'en': None,
    'ko': None,
}

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def preprocessing_fn(inputs):
    outputs = {}

    en = str(inputs['en'])
    ko = str(inputs['ko'])

    print(type(en), en)
    print(type(ko), ko)

    outputs['en_xf'] = tokenizer(en,
                                 padding=True,
                                 truncation=True,
                                 max_length=512,
                                 return_tensors="tf",
                                 )
    outputs['ko_xf'] = tokenizer(ko,
                                 padding=True,
                                 truncation=True,
                                 max_length=512,
                                 return_tensors="tf",
                                 )

    return outputs

토크나이저는 AutoTokenizer를 통해 KoAlpaca의 토크나이저를 가져왔다. KoAlpaca의 토크나이저는 한국어와 영어 토크나이징을 한번에 할 수 있을 것이기 때문에 사용하기로 했다.

이렇게 선언된 preprocessing_fn()을 Tranform 컴포넌트에서 불러온다.

from tfx.components import Transform

transform = Transform(
    examples=example_gen.outputs['examples'],
    schema=schema_gen.outputs['schema'],
    module_file=os.path.abspath('transform.py'),
    )

context.run(transform)

하지만 이렇게 선언하자 이러한 오류가 발생했다.
| TypeError: Expected Tensor, SparseTensor, RaggedTensor or Operation got {'input_ids': [<tf.Tensor 'Identity_22:0' shape=() dtype=int32>, <tf.Tensor 'Identity_23:0' shape=() dtype=int32>, <tf.Tensor 'Identity_24:0' shape=() dtype=int32>, <tf.Tensor 'Identity_25:0' shape=() dtype=int32>, ...

아무래도 타입 문제로 보이는데 구체적으로 어느 코드를 실행하다 발생했는지 확인할 수 없었다.
다음주에는 해당 문제를 고치려 한다.

저작자표시 비영리 동일조건 (새창열림)