If you are reading this, you probably come from the SaCoCo tutorial, you have access to a CQPweb installation as administrator, and you want to encode the corpus. We document below two approaches:
The first approach is more involved, but allows for much more control and freedom. The second might be helpful, specially if you are a beginner, and the annotation of your corpus is pretty basic.
Let’s assume that you have:
cqp installed in your computerThe first thing we need to do is to encode the corpus. This process will create a number of files that will enable to use the CQP language to query the corpus.
Once we have the texts in VRT format, encoding the corpus for the CWB is relatively easy.
Check that you have the corpus work bench installed in the computer, if not, download it and follow these instructions. We compiled from source version 3.4.8.
Now, run the following commands:
# create the target folder for encoded data
mkdir -p data/cqp/data
# run the command
cwb-encode -c utf8 -d data/cqp/data -F data/contemporary/meta/ -F data/historical/meta -R data/cqp/sacoco -xsB -S text:0+id+collection+source+year+decade+period+title -S p:0 -S s:0 -P pos -P lemma -P norm
# generate the registry file
cwb-make -r data/cqp -V SACOCOThe cwb-encode’s parameters explained:
-c to the declare the character encoding-d path to the target directory were the output will be stored-F path to the input directory were the VRT files are located-R path to the registry file-xsB
x for XML compatibility mode (recognises default entities and skips comments as well as an XML declaration)s to skip blank lines in the inputB to strip white spaces from tokens-S to declare a structural attribute, example:
-S text:0+id+authors/text, structural attribute to be declared0 embedding levelsid will be an attribute of text containing some value-P to declare positional attributesGet extensive information on how to encode corpora for the CWB in the encoding tutorial.
TIP: for development/testing purposes, just run the command below on the test files.
# create the target folder for encoded data
mkdir -p test/cqp/data
# run the command
cwb-encode -c utf8 -d test/cqp/data -F test/contemporary/meta/ -F test/historical/meta -R test/cqp/sacoco -xsB -S text:0+id+collection+source+year+decade+period+title -S p:0 -S s:0 -P pos -P lemma -P norm
# generate the registry file
cwb-make -r test/cqp -V SACOCOOnce you have the data you have to upload the file to the server where CQPweb is installed. In our case is the machine fedora.clarin-d.uni-saarland.de.
In our case, one needs to connect to the server as root user. There are different methods to upload the files:
scp or rsync which use the ssh protocolUpload the local folder data/cqp/sacoco/ to the remote folder (in the server) /data2/cqpweb/indexed, and the registry file data/cqp/sacoco to the folder /data2/cqpweb/registry.
Once all files are uploaded, you have to check the ownership of the folder/file:
wwwrunwwwIf not just run a couple of commands:
chown -R wwwrun:www /data2/cqpweb/indexed/sacoco
chown wwwrun:www /data2/cqpweb/registry/sacocoThen, modify the registry file /data2/cqpweb/registry/sacoco to indicate the location of the corpus in the server /data2/cqpweb/indexed/sacoco.
Go to admin control panel in the left-hand menu Account actions.We can now start installing the corpus:
Install a new corpus in the left menu CorporaClick here to install a corpus you have already indexed in CWB. which you will find in the grey row at the top of the page.sacocoSaarbrücken Cookbook CorpussacocoInstall corpus with settings above that you will find at the bottom of the page.A new page will load:
Design and insert a text-metadata table for the corpusA new page will load:
sacoco.meta in section Choose the file containing the metadataDescribe the contents of the file you have selected, providing for Handle and Description:
collection as the primary category.title as free textYes please in section Do you want to automatically run frequency-list setup?install metadata table using the settings aboveNow set up the annotation (positional attributes):
Manage annotation, you will find it in the left menu, in section Admin Tools.Go!Go!Go!pos as Primary annotation aboveUpdate annotation settings.Check corpus settings:
Corpus settings in Admin toolsGeneral options:
The corpus is currently in the following category: Historical corporaUpdate buttonUpdate buttonWe set the access to this corpus open for everybody:
Admin Control Panel in Admin toolsManage privileges in Users and privilegessacoco from list Generate default privileges for corpus...Generate default privileges for this corpus.Manage group grants in Users and privileges
Grant new privilege to groupeverybodyNormal access privilege for corpus [sacoco]Grant privilege to group!Hurraaaaah! Corpus ready to be queried!
Let’s assume that you have administrator access to a CQPweb installation. We will guide you in the following lines through the process of setting up the corpus.
We need a single XML file containing all texts. texts2corpus.py helps us to ease the task.
texts2corpus.py:
.vrt files contained in the input folders<text> nodes<text> nodes to a parent element called <corpus><corpus> as a single XML fileIts usage is pretty simple, just provide the path to the folders containing the .vrt files with metadata, and the path to the output folder:
python3 texts2corpus.py -i data/contemporary/meta data/historical/meta -o data/sacoco.vrtTIP: for development/testing purposes, if you just run
python3 texts2corpus.py, it will work on the testing dataset stored in the test folder. ```
Go to admin control panel in the left-hand menu Account actions.We need to upload the corpus file (sacoco.vrt) and the metadata file (sacoco.meta).
For each file:
Upload a file in the left menu Uploads.Choose File, a dialogue window will open, pick the file you want to upload.Upload File.We can now start installing the corpus:
sacocosacocoSaarbrücken Cookbook CorpusSelect files section
sacoco.vrtS-attributes section
Use custom setupp:0P-attributes section
Use custom setupPrimaryInstall corpus with settings above at the bottom of the page.A new page will load:
Design and insert a text-metadata table for the corpusA new page will load:
sacoco.meta in section Choose the file containing the metadataDescribe the contents of the file you have selected, providing for Handle and Description:
Yes please in section Do you want to automatically run frequency-list setup?install metadata table using the settings above